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I  .  INTRODUCTION 


One  of  the  most  Important  tools  for  analyzing  economic  policies 
is  the  micro  analytic  model.  This  technique  is  used  frequently  in 
public  decision-making  centers.  Virtually  every  Federal  Agency  uses 
micro-analytic  models  for  the  evaluation  of  policy  proposals. 

Direct  use  of  sample  observations  rather  than  aggregated  data 
is  characteristic  of  the  micro  analytic  approach.  For  this  reason, 
the  micro-data  that  is  used  as  input  to  the  model  has  a  significant 
bearing  on  the  validity  of  the  results  of  the  model.  Furthermore, 
when  all  the  input  data  come  from  a  single  sample,  the  quality  of  the 
model  depends  on,  among  others,  sampling  and  data  recording  proce 
dures.  However,  if  the  data  from  a  single  source  is  Insufficient  or 
partly  aggregated,  then  typically  multiple  sources  of  data  are  used 
to  provide  the  necessary  input  to  the  model.  At  the  same  time, 
issues  such  as  validity  and  quality  of  the  results  of  the  model 
cannot  be  assessed  as  easily  as  when  we  have  a  single  source  of  data 
as  input.  In  such  situations,  government  statisticians  have  been 
using  a  methodology  in  which  multiple  sources  of  data  are  merged  to 
form  a  composite  data  file.  Effective  use  of  the  different  pieces  of 
data  in  order  to  produce  sensible  but  more  comprehensive  files  is  a 


fundamental  issue  in  the  file  merging  methodology. 


Some  of  the  difficulties  associated  with  the  merging  procedures 
and  techniques  for  their  resolution  have  been  known  for  quite  some 
time.  Initiated  by  the  Federal  Subcommittee  on  Matching  Techniques, 
there  has  recently  been  renewed  effort  to  establish  solid  theoretical 
foundation  and  empirical  justification  for  t  tie  file  merging  mettxxi 
ology.  This  research  reviews  the  relevant  literature  and  then  pre 
sent s  new  statistical  properties  of  some  known  procedures  for  merging 
data-  files.  We  shall  now  give  an  example  of  a  typical  situation  in 
which  merging  of  two  files  is  carried  out. 


1 . 1  A  Paradigm 

A  micro  economic  model  in  heavy  use  at  the  Office  of  Tax 
Analysis  (OTA),  Department  of  the  Treasury,  is  the  Federal  Personal 
Income  Tax  Model .  This  model  is  used  to  assess  proposed  tax  law 
changes  in  terms  of  their  effects  on  the  distribution  of  after  tax 
income,  the  efficiency  with  which  Lhe  changes  will  operate  in 
achieving  their  objectives,  etc.  The  inputs  for  this  model  are  two 
sources  of  micro  data,  namely  the  Statistics  of  Income  File  (SOI) 
and  tiie  Current  Population  Survey  (CPS'.  The  SOI  file  is  generated 
annually  by  the  Internal  Revenue  Service  (IRS)  and  it  consists  of 
personal  tax  return  data.  The  CPS  file  is  produced  monthly  by  the 
Bureau  of  the  Census.  As  we  will  explain  in  Section  1.2,  such 


pooling  of  daTa  from  more  than  one  Federal  Agency  has  been  severely 
restricted  in  recent  years  by,  among  others,  confidentiality  issues 
such  as  t  tie  privacy  of  the  individuals  Involved  in  t  tie  aforementioned 
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files  of  data.  For  this  reason,  complete  information,  especially 
identifiers  such  as  social  security  numbers,  is  typically  not 
released  by  the  IRS  and  the  Census  Bureau.  The  resulting  micro- data 
files  are  compromises  between  complete  Census  files  and  fully  aggre 
gated  data-sets.  Thus,  sufficient  detail  remains  to  support  micro 
analysis  of  the  population,  while  partial  aggregation  protects 
individual  privacy  and  greatly  diminishes  computational  burden. 

A  typical  problem  in  tax-policy  evaluation  occurs  when  no  single 
available  data  file  such  as  SOI  or  CPS  contains  all  the  information 
needed  for  an  analysis.  For  example,  consider  the  variables 
W  (X.Y.Z  ,Z  ),  where 

X  =  Allowable  Itemizations  and  capital  gains 
Y  -  Old  Age  Survivors  Disability  Insurance  (OASDI) 

-  Social  security  number 

-  Marital  status 

Suppose  that  we  are  interested  in  estimating  a  simple  correlation 
P  between  X  and  Y  or,  more  generally,  the  expectation  of  a  known 

A  ,  I 

function  g,  say,  of  W;  that  is  the  integral 

y  f  g(w)  dF ( w)  (1.1.1) 

where  F(w)  is  the  joint  distribution  function  of  the  variables  in  w. 
Now,  the  SOI  microdata  file  cannot  be  used  in  its  original  form  since 
it  does  not  include  the  OASDI  benefits  (Y).  Census  files  (CPS)  with 
OASDI  benefits  do  not  allow  a  complete  analysis  of  the  effect  of 
including  this  benefit,  since  it  does  not  contain  Information  on 


instead  of 


allowable  Itemizations  and  capital  gains  (X).  Thus 
observing  X.Y.Z^.Z  jointly  on  the  same  units,  we  have  to  get.  only 
the  following  pair  of  flies: 

File  1  (SOI  !  :  X,Z]  ,7.? 

and 

File  2  (CPS) :  Y , Z , ,Zn 

L  < 

Estimating  y  based  on  the  f ragmetary  data  provided  by  File  1  and 
File  2  is  an  important  practical  problem  that  has  not  yet  been  solved 
sat i s factor ) ly .  In  an  attempt  to  cope  with  situations  such  as  the 
or*  model ,  Federal  Agencies  have  long  been  using  procedures  ft. 
ma'  >  h' ug  or  merging  the  two  incomplete  files  so  that  one  can  do  the 
usual  inference  for  y,  hoping  that  the  merged  file  is  a  reasonable 
substitute  for  the  unobserved  data  on  (X , Y ,Z  ,Z  l . 

Tin?  report  1  ng  units  in  CPU  are  households.  In  general,  t  tie 
units  in  a  file  may  refer  to  other  types  of  legal  persons,  like 
corporations,  partnerships  and  fiduciaries.  The  term  "individual” 
will  be  used  as  a  generic  label  in  this  thesis  to  refer  to  the 
reporting  units  of  the  micro  data  files. 

• .2  A  Dichotomy  of  Matching  Problems 

Roughly  speaking,  there  are  two  different  categories  of  matching 
problem.  The  fir;:*,  category  consists  of  problems  of  exact  matching 
in  which  it  Is  desired  t<>  identify  pairs  of  records  in  the  two  files 
that  pertain  to  the  same  individual.  Accurate  information  on  identi 
Tiers  such  as  mv-ial  security  number,  name,  address  are  assumed  to  be 


A  ^  A  .Jh 


available  when  exact-matching  the  two  files,  it  is  clear  that  all  we 
need  to  carry  out  an  exact  match  of  two  files  is,  among  other  tools, 
an  efficient  software  to  sort  the  individuals  by  their  identifiers. 
With  the  help  of  such  software,  we  can,  within  reasonable  error,  link 
a  given  individual  in  File  1  with  an  individual  in  File  ?  such  that 
these  two  units  possess  the  same  values  for  the  Identifiers.  The 
resulting  merged  file  contains  data  which  are  more  comprehensive  than 
both  File  1  and  File  2.  Also,  even  after  merging,  most  records  will 
pertain  to  the  same  individual,  the  number  of  erroneous  matches  in 
the  enlarged  file  depending  on  the  particular  software  used  in  the 
process  of  merging.  It  is  clear  that,  if  accurate  identifiers  are 
available  for  the  units  in  the  two  files,  then  no  statistical  Issues 
are  involved  in  the  matching  methodology  and  we  shall  not  discuss 
this  type  of  problem  any  more.  However,  one  may  refer  to,  among 
others,  Fellegi  and  Sunter  (1969)  and  Radner  et  ai .  (lu80)  for  work 
related  to  the  exact  matching  methodology.  We  shall  close  our 
discussion  of  this  type  of  matching  problem  by  noting  some  of  the 
reasons  why  exact  matching  of  files  is  often  not  possible. 

First,  over  the  past  several  years,  there  have  been  significant 
changes  in  the  laws  and  regulations  pertinent  to  exact  matching  of 
records  for  statistical  and  research  purposes.  New  laws,  especially 
the  Privacy  Act  of  1974  and  the  Tax  Reform  Act  of  19^6,  have  Imposed 
additional  restrictions  on  the  matching  of  records  belonging  to  more 
than  one  Federal  Agency  and  on  the  matching  of  files  of  Federal 


Agencies  with  those  of  ot  tier  organlzut  ions  .  As  a  result  of  these 
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laws,  some  Agencies  have  limited  access  to  their  records  for  star  Is 
tlcal  purposes  (o  an  even  greater  extent  than  seems  necessary  t>y 
statutory  requirements. 

Second,  analyses  of  mlerodal  a  often  involve  data  from  units  t  fiat 
are  not  available  from  a  single  source  but  are  available  from  several 
sources.  For  example,  suppose  that  one  is  interested  in  the  relation 
ships  among  two  sets  of  variables,  one  set  consisting  of  information 
about  health  care  expenses  incurred  by  individuals  and  the  other  set 
consisting  of  information  about,  receipt  of  various  types  of  welfare 
benefits.  Suppose  further  that  no  existing  data  file  contains  all  of 
the  needed  variables,  but  that  two  samples  of  a  target  population, 
which  come  from  two  different  surveys ,  together  contain  all  these 
v-<”  :  tables  .  /  f  executing  a  new  survey  to  obtain  all  t  lie  variables 

from  a  single  sample  Is  not  feasible,  then  one  might,  match  the  two 
samples  and  use  the  merged  file  for  statistical  analyses  of  variables 
which  are  not  present  in  the  same  sample.  Note  that  the  two  sample 
surveys  may  have  information  on  the  same  individuals  whose  iden 
titles  are  either  unknown  or  unreliable.  However,  In  the  afore 
men!  toned  example,  it  is  more  appropriate  to  assume  that  'he  two 
samples  contain  very  few  or  no  Individuals  In  common.  In  case  the 
two  samples  are  stochastically  Independent,  we  shall  describe  the 
units  in  the  two  samples  as  similar  individuals. 

Suppose,  t  hen ,  t  hat  exact  matching  Is  not  feasible  in  view  of 
the  aforementioned  reasons.  Then  the  tools  that  are  us^d  in  the 
exact  matching  methodology  are  Inadequate  for  the  purpose  of  merging 
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the  two  files  of  data.  In  particular,  identifiers  are  practically 


useless.  However,  the  probabilistic  structure  of  the  populations 


that  generate  the  data  in  the  two  files  or  other  statistical 
techniques  can  often  be  used  to  combine  the  two  files.  Such  proce 


dures  will  be  called  statistical  matching  strategies. 

In  the  literature  on  matching  files  there  is  no  consensus  on 
rigid  definitions  of  Exact  Match  and  Statistical  Match  Indeed,  it 


is  traditional  to  distinguish  these  two  types  of  problem  by  verify 
ing  whether  same  (exact)  or  similar  (statistical)  individuals  are  in 
the  two  files.  Our  class  if lcation  of  matching  problems  is  somewhat 
different  from  the  usual  practice  in  the  sense  that  any  procedure 
for  merging  files,  which  may  contain  the  same  or  similar  individuals, 


will  be  described  as  a  statistical  match  if  statistical  techniques 
are  involved  in  the  process  of  merging.  This  convention  is  in  agree 
ment  with  that  of  Woodbury  (1983),  who  describes  certain  matching 
problems  involving  the  same  Individuals  in  two  files  as  "Statistical 
Record  Matching  for  Files”. 

1 . 3  A  General  Set-up  for  Stat 1 stlcal  Match) ng 
Consider  a  universe  w  of  individuals.  Let  X,  Y,  Z  denote  three 
groups  of  random  variables  and  let  us  assume  that  we  cannot  observe 
the  vector  W  -  (X,Y,Z)  for  any  unit  in  V/.  However,  suppose  that  the 
following  data  are  available: 

(Base)  File  1:  n^  individuals,  each  with  information  on  a 
function  W“ ,  say,  of  W. 


K 


v 


> 


f  .V.VJWWWK 


and  (Supplementary)  File  2:  n  Individuals,  each  with  Information 
on  a  function,  ,  say,  of  W. 

Various  matching  problems  arise  depending  on  what  type  of  data  are  In 

W“  and  W“ .  We  distinguish  only  three  different  situations: 

Qi|se_ :  W-  =  X  and  -  Y;  we  also  assume  that  the  two  files 

contain  the  same  Individuals. 

Case  II :  I.et  W*  -  (X.Z)  ,  W“  (Y.Z).  As  In  Case  I,  we  further 

assume  that  the  two  files  contain  t  tie  same  Individuals. 

Case  ril:  l.et  ,  (X.Z),  w"  -  (Y.Z).  Unlike  In  Cases  I  and  II,  we 

assume  that  the  two  flies  contain  similar  individuals. 


1 . A  The  Hatching  Methodology 
Some  Important  Steps 

We  shall  now  mention  some  steps  involved  in  actually  creating  a 
statistical  match  between  two  given  files.  First,  If  the  populations 
represented  by  the  files  differ,  a  "universe  adjustment"  Is  carried 
out  to  ensure  that  there  Is  a  common  universe  //  from  which  the  indl 
vlduals  of  the  t  wo  files  are  sampled.  Second,  a  "units  adjustment" 
might  be  needed  If  the  units  of  observat  Ion  in  the  two  files  differ 
(e.g.  persons  and  tax  units).  Third,  "matching  or  common  variables," 
Z,  are  defined  and  it  is  assumed  that  File  1  with  n^  records  carries 
information  on  (X.Z),  whereas  File  2  with  n ^  records  consists  of  data 
on  (Y.Z)  -  The  variables  X  and  Y  are  often  called  non  matching 
variables.  Finally,  In  the  "merging"  step,  if  ttie  records  (X  ,Z  ), 
and  (Y  ,Z  ) ,  respectively  from  File  1  and  File  2,  are  to  be  matched, 
then  one  completes  the  1th  record  In  File  1  by  substituting  for 


E* 


the  missing  value.  Thus,  we  get  the  synthetic  file  1: 


•  1  I.*.  -  *'  L 

Clearly,  the  same  methodology  can  be  used  to  get  a  synthetic  File  ? 
by  Finding  substitutes  for  missing  X  values  of  File  2  using  X's  from 
File  1.  However,  in  order  to  keep  our  discussion  simple,  we  shall 
often  be  concerned  with  completing  only  File  1.  Although,  many 
different  methods  have  been  used  In  this  final  step,  several  basic 
similarities  can  be  identified.  In  most  matches,  certain  Z  variables 
are  treated  as  the  so  called  "cohort’’  variables.  Such  variables 
establish  "packets"  of  the  records  in  each  of  the  two  files,  with 
matching  permitted  only  between  pairs  of  cases  in  t  tie  name  packet 
For  example,  rex  is  often  a  cohort  variable  so  that  a  male  can  be 
matched  with  another  male,  and  a  female  with  another  female.  This 
step  about  the  formation  of  cells  or  packets  is  aimed  at  diffusing 
the  dissimilarities  between  units  that  are  being  matched.  Further 
more,  depending  on  how  many  of  the  common  variables  are  used  as 
cohort  variables,  there  may  be  very  little  or  no  within  packet 
vac,, it  ion  with  r<  gard  to  Z.  In  such  situations.  File  1  has  data  on 
X  and  File  2  has  data  on  Y  and  we  would  like  to  merge  t he  files  to 
gel  joint  informat  ion  on  X  and  Y.  Note  that  ,  in  beet  ion  1 .1,  such  a 
scMuar  io  was  labeled  Case  I.  The  selection  of  "matching  records” 
within  a  packet  is  typically  based  on  a  "measure  of  dissimilarity”  by 
which  a  "distance”  is  computed  between  a  given  File  1  record  and  each 
potential  match  in  the  supplementary  flje.  A  potential  match  with 


r he  smallest  distance  Is  chosen  as  the  match  that  will  provide  the 
missing  Y  value  to  a  file  1  record. 


1  .  *1  Two  Basic  Types  of  M a t eh  1  n p  Strategies 
Suppose  that  the  age  of  an  individual,  Z^,  say.  Is  a  matching 
variable.  Then,  one  may  define  a  distance  measure  d,  say,  between 
individuals  i  in  File  1  and  j  in  File  ?  by  the  equation 


d.  .  =  |  Z  -  Z,  . 

ij  U 


( 1 .  *> .  1 ) 


For  fixed  1  1,Z, 

t  h 


n^,  one  will  then  match  one  possible  j"  In 


File  2  with  1  record  In  File  1  if  .)"  minimizes  d  over  j.  That 
Is,  j“  depends  possibly  on  i  and  satisfies  the  restriction 


1J‘ 


min  d . 


1< j<n. 


ij 


(1.5.2) 


If  the  choice  of  J*  Involves  no  other  restrictions,  then  the  statls 
tlcal  matching  strategy  is  called  "Unconstrained  Matching".  However, 
there  are  typically  additional  restrictions  subject  to  which  one  must 
choose  the  optimal  match  j"  from  File  2.  Matching  data  fLles  with 
T he  restrl'M  ion  1  hat  the  variance  covariance  mairlx  of  data  items  in 
each  file  be  identical  to  the  variance1  covariance;  matrix  of  the  same 
data  items  in  the  matched  file;  is  an  example  of  a  "Constrained  Match. 

In  order  to  formulate  this  type  of  merging  mathematically, 
assume  first  for  simplicity,  that  both  files  carry  only  n  records; 
that  is,  the  common  value  of  n^  and  n^  is  n .  Let 
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If  record  in  File  1  is  matched  with  the 
record  in  File  2  1  <  1,  j  <  n  (1.^.3) 

if  the  ith  record  in  File  1  is  not  matched  with  the 
jtfl  record  in  File  2 
Then,  the  following  additional  conditions  will  ensure  that  the 
aforementioned  preservation  of  moments  is  achieved  by  not  letting 
more  than  one  record  in  File  1  to  be  matched  with  the  same  record  in 
File  2 : 

n 

I  a  l ,  for  j  -  1 ,2,  ,  n  (  1  .•>.'*) 

l  1  J 

n 

X  a  ^  1,  For  1  --  1,2 . n  (  1  .•>.'>) 

j-=l  J 

Now  let  d^j  denote,  as  in  the  case  of  a  unconstrained  match,  a 
measure  of  inter  record  dissimilarity  given  by  the  extent  to  which 

the  attributes  in  any  one  record  differ  from  the  same  attributes  in 
another  record.  Then  the  optimal  constrained  match  minimizes  the 
"objective  function" 

n  n 

1  d  a  (  1  .  V  b  > 

i  1  J  -L  lJ  J 

Subject  to  the  restrictions  in  (i.b.3)  to  (  1  .  5 . b )  .  Clearly,  this 
extremal  problem  is  the  standard  linear  assignment  problem  in 
"Opt imi zat ion . " 

A  matching  situation  more  typical  of  problems  relating  to  policy 
analyses  is  a  constrained  merge  of  two  files  with  variable  weights 


in  both  files  and  an  unequal  number  of  records  in  the  files. 


Let  a. 


.  th 


be  the  weight  of  the  i  record  In  File  1,  and  let  be  the  weight 
of  the  j  record  in  File  2.  If  ,  n  are  respect  1  vely ,  t  tie  number 
of  records  in  File  1  and  File  2 ,  then  we  minimize  the  objective  function 
in  (1.5.6)  subject  to  the  following  constraints. 


and 


n 

2 

l  a 
i  i 


ij 


i  1.2 


(1.6.  / ) 


>:1  a  S  j  -  1.2. 
i  1  J 


•  *  n„ 


M  .6.8) 


n 


l 

l  a, 


2 


‘I  X  B1 
i  -- 1  j-1  J 


(1.5.9) 


a  >  0 ,  V  i  and  j 


(1.6.10) 


It  is  clear  that  an  opt  Into!  constrained  matching  strategy  when 
Me*  two  files  have  unequal  number  (){'  Individuals  is  the  solution  of 
a  standard  transportation  problem  in  which  the  roles  of  the  "ware 
houses"  and  "markets"  are  respectively  played  by  the  records  in  File 
1  arid  File  2  and  the  "cost  of  transportation"  Is  the  inter  record 
distance  Md  " .  Fxisllng  algorithms  to  solve  a  linear  assignment  or 
transportation  problem  ean  tie  used  to  complete  the  final  "merge" 
step,  giving  us  the  synthetic  sample 

VT  =  'XT'?, >.  1  <  1  <  n.  .  (1.6.11) 
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where  YJ  denotes  the  value  of  Y  assigned  to  the  1th  record  of  File  1. 

The  sample  In  (1.5.11)  may  now  be  used  to  estimate  a  parameter  like 
Y  In  (1.1.1). 

1 . 6  Cr i t lclsms  of  Statistical  Mat  ch i ng 
In  Sections  l.A  and  1.5,  we  described  the  general  form  of  most 
matching  techniques  that  have  been  used  by  Federal  Agencies. 

Matching  records  at  the  "packet"  level  means  basically  that  the 
random  vectors  X  and  Y  are  stochastically  independent,  given  the 
value  of  the  common  variables  2.  In  the  particular  case  of  a  multi 
variate  normal  distribution  for  U  =  (X.Y.Z),  conditional  independence 
assumption  is  equivalent  to  the  claim  that  the  partial  correlations 
among  X  and  Y  variables,  controlling  on  the  Z  variables,  are  all 
zero.  This  point  was  made  first  by  Mms  (1972)  and  repeatedly  by 
others  since  then.  The  conditional  1 ndependence  assumption  Is  a 
strong  one  for  which  convincing  Justifications  has  generally  not  been 
offered.  It  implies  that  the  relationships  between  X  arid  Y  can  be 
totally  inferred  from  X's  relation  to  Z  and  Y's  relationship  to  Z. 
Sims  (1978)  stated  that  matching  the  files  under  such  assumptions  is 
unnecessary.  He  also  sketched  an  alternative  statistical  procedure 
that  uses  the  data  in  the  two  files  to  estimate,  under  conditional 
Independence,  a  parameter  such  as  y  in  (1.1.1).  Hi  ms'  alternative 
will  be  discussed  further  in  Section  3.2 

Fellegl  (1978)  and  many  other  investigators  have  expressed  great 
e. nit  Ion  about  t  tie  use  of  statistical  matching  because  not  mu<  h  Is 


known  about  the  accuracy  of  the  estimates  of  the  joint  distribution 


of  W  produced  by  synthetic  files. 

Notwithstanding  these  criticisms  of  statistical  matching,  there 
is  no  viable  alternative  statistical  procedure  that  will,  in  general, 
provide  better  estimates  of  y  than  a  synthetic  file  can  offer. 

Given  this  lack  of  good  alternatives,  especially  when  conditional 
independence  does  not  hold,  the  area  of  statistical  matching  is  wide 
open  and  both  theoretical  and  empirical  Investigations  to  discover 
the  properties  of  synthetic  data  files  are  in  order. 

1 ■ 7  Reliability  of  Synthetic  Files 

The  precision  of  synthetic  file- based  estimators  of  a  given 
parameter  relevant  to  the  population  of  W  ~  (X.Y.Z)  is  affected  by 
various  types  of  errors  that  occur  while  matching  two  files.  To 
discuss  these  matching  errors,  let  us  first  restrict  our  attention 
to  the  cases  where  the  same  individuals  are  In  the  two  files,  namely 
Case  r  and  Case  I l . 

In  practice,  it  1 almost  inevitable  in  most  matching  projects 
that  some  matching  errors  occur,  even  with  the  most  sophisticated 
procedure  and  the  most  careful  execution  of  matching  of  the  files. 
These  errors  fail  into  two  major  categories: 

!i)  Erroneous  match  (false  match)  or  linking  of  records  that 
correspond  to  different  individuals. 

(il)  Erroneous  non  match  'false  non  match)  or  failure  to  link  the 
records  that,  do  correspond  to  the  same  individual. 
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The  reliability  of  the  results  of  a  statistical  matching 
strategy  is  often  defined  (Radner  et  al . ,  1980,  p.  13)  as  one  of  the 
following  coefficients: 

(a)  the  proportion  of  the  correct  matches,  that  is,  matches  of 
records  on  the  same  Individuals. 

(b)  the  proportion  of  erroneous  decisions,  that  is,  false  matches 
and  erroneous  non  matches . 

These  reliability  coefficients  are  random  variables  because,  In 
view  of  the  terminological  conventions  of  Section  1.2,  a  statistical 
matching  strategy  is  dependent  on  the  data  in  the  two  flies.  The 
sampling  distribution  of  the  reliability  coefficients,  either  exact 
or  asymptotic  (as  the  sizes  of  the  files  grow),  are  very  useful  in 
judging  the  quality  of  a  given  matching  procedure. 

Now,  we  will  discuss  the  reliability  of  a  synthetic  file  in 
Case  III,  where  the  two  files  contain  very  few  or  no  overlapping 
individuals.  First,  note  that  The  definitions  of  error  In  the 
results  of  matching,  which  have  been  proposed  for  Case  I,  are  not 
applicable  to  Case  [II  because  the  linkage  of  records  from  the  two 
files  that  pertain  to  the  same  unit  seldom  occurs  in  Case  (II.  In 
other  words,  almost  all  linkages  In  Case  III  are  false  matches  in  the 
sense  of  the  definitions  given  earlier  in  this  section.  In  Case  III, 
definitions  of  error  and  reliability  which  are  tractable  from  a 
theoretical  perspective  are  unavailable  at  this  time.  In  fact, 


lilt  le  t heoret  ical  work  on  the  errors  present  in  the  synthetic  files 


of  Case  III  has  been  done.  Until  now,  The  evaluation  of  a  given 
matching  strategy  in  Case  III  has  been  done  from  an  empirical  point 
of  v lew.  A  case  in  point  is  the  work  of  Rodgers  (1984). 

I  .  8  Siimm.i  i  \ 

In  Section  1.3,  three  important  cases  for  merging  two  files  of 
data  were  distinguished.  Of  these.  Case  I  and  Case  II  are  relevant 
when  the  same  individuals  are  represented  in  the  two  files.  Case  III 
arises  when  only  similar  individuals  are  present  in  the  files.  This 
research  is  concerned  with  both  theoretical  investigations  and 
empirical  evaluat  ions  of  the  quality  of  synthetic  files  in  Case  I  and 
Case  III  We  shall  not  discuss  Case  II  in  this  thesis. 

In  Chapter  ,  Case  I  is  discussed  at  some  length.  A  review  of 
known  results  for  this  case  is  given.  New  optimality  properties  of 
a  maximum  likelihood  matching  strategy  are  established.  Some  small 
sample  and  large  sample  properties  of  the  number  of  correct  matches 
with  regard  to  this  strategy  are  derived,  shedding  some  light  on  the 
reliability  of  ttie  synthetic  file  arising  from  using  the  maximum 
1  '.  ke  1  i  flood  s  t  rategy  . 

Case  III  is  t  fie  topic  of  interest  in  Sect  ion  3.  The  bulk  of  the 
discussion  In  this  Chapter  is  confined  to  matching  two  files  of  data 
that  are  sampled  from  a  trlvarlate  normal  population.  Thus,  if 
(X,Y,Z)  is  a  three  dimensional  normal  random  vector,  Kile  1  has  data 
on  (X,Z),  while  FLle  ?  has  data  on  (Y,Z).  Two  strategies  proposed  by 
Kadane  (1978)  and  one  strategy  due  to  Sims  (1978)  are  used  to  create 
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synthetic  flies  out  of  simulated  data  on  (X,Z)  and  (Y,Z).  These 
synthetic  flies  are  then  evaluated  by  comparing  the  estimates  of  the 
correlation  between  X  and  Y  provided  by  them  with  the  estimates  based 
on  unbroken  data  on  (X,Y,Z). 
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MERGING  KIl.ES  OF  DATA  ON  SAME  INDIVIDUALS 


A  useful  classification  of  situations  involving  statistical  mat 
rhlng  of  'lata  files  was  discussed  in  Section  1.3.  It  may  be  recalled 
that  in  the  context  of  the  two  files  having  the  same  Individuals,  this 
classification  scheme  included  two  cases.  Case  I  is  the  scenario 
where  no  matching  variables  2  are  present,  while  case  II  is  the 
situation  where  matching  variables  are  part  of  the  statistical  model. 
In  this  chapter,  we  shall  discuss  results  relevant  to  case  1  only. 


2  1  A_General  Model 


T 

Let  1 1| !  be  a  multi  dimensional  random  vector  with  C.D.F  H(t,u) 

Xi 

and  p  P.F  hft.u).  Let  (pi,  1  1,2,  ....  n  be  a  random  sample  of 

rl/.e  n  from  fi .  We  shall  assume  that  these  sample  values  got  broken  up 
into  the  component  vectors  T’s  and  U’s  before  the  data  could  be 
recorded.  Thus  we  do  not  know  which  ~  and  U  values  were  paired  In  the 
original  sample  and  the  two  files  consist  of  the  following  data: 


Fllel  x,,x„,  ...,x, 
-I  ?  n 


which  Is  an  unknown  permutat  ion  of  T  ,  .  .  .  ,  T  ,  arid 

~1  -n 


File  ?  Y  .  Y  , 


Y  , 
~n 


which  is  an  unknown  permutation  of  Uj  ...  Un 


DeGroot,  Feder  and  Goel  (1971)  call  this  a  "Broken  Random  Sample" 
model  For  two  files. 

Two  types  of  statistical  decision  and  inference  problems  arise 
from  observing  a  broken  random  sample.  The  first  type  of  problem 
involves  trying  to  pair  the  x's  with  the  ^’s  in  the  broken  data  in 
order  to  reproduce  the  pairs  in  the  original  unbroken  sample.  The 
second  type  of  problem  involves  making  inferences  about  the  values  of 
parameters  in  the  joint  distribution  H(t,u)  of  T  and  U. 

This  chapter  will  be  organized  into  a  review  of  the  literature  on 
matching  problems  in  Sections  2.3  to  2.5,  followed  by  a  discussion  of 
statistical  properties  of  some  matching  strategies  in  Sections  2.6  to 
2.9. 

2 . 2  Notations 

In  this  section,  we  Introduce  most  of  the  notations  that  will  be 
used  in  the  present  chapter. 

T 

(1)  (y)  will  denote  a  multivariate  random  vector.  It  is  assumed  to 

have  an  absolute  continuous  joint  cumulative  distribution  func 

t ion  (CDF)  H(t,u)  and  joint  density  hft^u);  the  context  will  make 

T 

tht-  dimensions  of  t  and  u  clear.  In  particular,  (y)  will  denote 

a  two  dimensional  random  vector,  with  h(t,u)  and  H(t  ,u)  reaper 

T 

tlvely  as  the  density  and  CDF  of  (y).  h  ^ ( *  )  and  h^(*l  will 
respectively  denote  the  marginal  dens  it  1 es  of  T  and  U  and  K ( » ) , 
<;(•)  will  be  t  tie  respective  marginal  distribution  functions. 

The  symbol  g  (  *  )  will  be  t  lie  generic  notation  For  t  tie  density 
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'»)  Let.  £->(>.  Vi  1,  .  .  .  ,  ri ,  define  events  A  ,  (<p,c  )  as  follows: 
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(2.2.1) 
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*5 )  Let  c  (  x  .  y  )  he  t  he  verier  i  <•  no*  at  ion  for  a  joint  densit  y  of  two 
random  variables  T  and  l!  which  are  marginally  uniform.  Then , 
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define  the  constant  K  as  J  c(x,x)dx,  which  is  the  density  of  the 

0 

random  variable  T  U  evaluated  at  zero.  For  any  fixed  integer  d, 
define 


S  - 

~n 
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....  S 

nd 

) ,  where 

(2.2.7) 
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then 

we  get  r  tie 

representat ion 
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1 , 2 ,  ...  .  Let 
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ed  by  the  vectors 

W1  ^ 

(Uj) 
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Let  fn(0)  be 

the  generic  notation  for 

the  character  l  st  ic  function  of  a  random  vector  n  ,  0  being  a  vec¬ 


tor-  of'  dummy  variables  whose  dimension  is  the  same  as  that  of  n 


V  ■ 


V. 


Let  V*r 


,  w  )  be  the  variable  5,  ,  when  W  takes  the 

~d  .)  k  -1 


value  w(,  1  1 , 2 ,  ....  d . 


n 


Let  t  (w  , 
•*k  -1 


w  )  and  5  )  t,  (w,  ,  ....  w  be 

"d  n  ,  •  -k  -1  -d 

k  1 


respectively  t  and  S  when  W.  -  w  ,  1 
^k  —  n  ~i  ~1 


1,2.  ...  ,  d. 

Let  =  ¥  (w  ,  ....  w  )  be  the  negative  logarithm  of  the 

modulus  of  the  characteristic  function  of  Eh,.!  .  (W  ,  ....  W  ) 

-*u  + 1  ~  l  -  d 


2 . 3  Data  based Matching  Strategies 

Pairing  the  observations  in  the  two  data  files  that  were  des 
bribed  in  Section  2.1  should  be  distinguished  from  the  problem  of 
matching  two  equivalent  decks  of  n  distinct  cards ,  which  is  discussed 
in  elementary  textbooks,  such  as  feller  (  1068 )  .  One  version  of  card 
matching  Is  as  Follows.  Consider  a  "target  pack"  of  n  cards  laid  out 
in  a  row  and  a  "matching  pack"  of  ttie  same  number  of  cards  laid  out 
randomly  one  by  one  beside  the  target  pack.  In  this  random  arrange 
ment  of  cards,  n  pairs  of  cards  are  formed.  A  match  or  coincidence 
is  said  to  have  occurred  in  a  pair  if  the  two  cards  in  the  pair  are 
identical.  Because  the  two  decks  are  merged  purely  by  chance  and 
without  using  any  type  of  observations  or  other  information  about,  the 
cards,  one  may  describe  such  problems  as  no  data  matching  problems. 

An  excellent  survey  of  various  versions  of  card  matching  schemes  is 
found  in  Barton  (lubK). 

Suppose  that  N  denotes  trie  number  of  pairs  In  the  aforementioned 
matching  problem  which  have  like  cards  or  matches.  The  derivation  of 


the  probability  distribution  of  N  dates  back  to  Montmort  (1708).  The 


following  Is  a  summary  of  some  of  the  well  known  properties  of  N 
(Feller  1968): 

Proposition  2.3.1:  If  is  the  probability  of  having  exactly  m 

matches,  then 

(1)  P[m]  =  mf  C1  -  1  +  i7  -  IT  +  •••  1  uTSo'f1  1  m  =  °*2>  •••• 


[n]  n! 


Noting  that 


is  the  probability  that  a  Poisson  random 


variable  with  mean  1  takes  the  value  m,  we  have  the  following 
approximation  for  large  n: 


[m]  ~  m! 

(Hi)  For  d  =  1,2,  ....  n,  the  dth  factorial  moment  of  N ,  namely 
E(N(d) ) ,  is  1 . 

As  one  might  expect,  for  certain  broken  random  sample  models,  it 
pays  to  match  two  files  of  data  using  optimal  strategies  based  on 


such  data.  Several  authors  starting  with  DeGroot,  Feder  and  Goel 
t  ! b / 1  )  have  proposed  and  studied  matching  strategies  based  on  broken 


data.  In  Section  2.9,  it  will  be  shown  that,  for  certain  matching 


strategies  based  on  independent  variables  T  and  U  the  distributional 
properties  of  t  tie  number  of  correct  matches  are  the  same  as  those 
mentioned  in  Proposition  2.3.1.  In  other  words,  as  far  as  statis 
t teal  properties  of  N  are  concerned,  matching  files  of  data  on  1 nde 


pendent  random  variables  is  only  as  good  as  no  data  matching  in  which 
w<*  randomly  assign  units  In  one  file  to  the  units  in  the  other  file. 


*V?nr*«I  Fi  ^  O  ^  V  V.VVVVVT\ >>TT^  ^VV  *.  V  V  V  V  V.  W  W  V!  W  ?C1  WJ 


?  .  4  Bepa lr  1 ng.  a  Broken  Band om  S amp  1  e 


fe  3 


2.4.1  The  Basic  Hatching  Problems 

Let  us  consider  matching  the  broken  random  sample  x^,  x?1  ..., 

x  ,  y.  ,  . . . ,  y  by  pairing  x  with  y  ....  for  i  =  1,2,  ....  n  where 
n  l  n  l  <p  ( i ) 

<p  -  (ip(l),  <p( n )  )  is  a  permut  at  ion  of  1,2,  ...,  n  v  As  we  seek  a 

v>  from  4>  that  will  provide  reasonably  good  pairings  of  the  x's  with 
the  y's,  we  need  to  clarify  the  fundamental  role  of  <p  in  the  star  is 
t.ical  model  described  in  Section  2.1.  If  we  treat  tp  as  an  unknown 
parameter  of  the  model,  then  the  likelihood  of  the  data  will  include 

■p  •  For  instance,  if  T  and  U  are  Jointly  bivariate  normal  with  means 

2  2 

(ip  variances  a  ,  and  correlation  coefficient  p,  then  the 

2  2 

log  likelihood  function  of  <p,  p  v^,  o^,  ,  given  the  broken 

random  sample,  is 


2  2 

L(<p,  p  ,  u  L .  v?  ,  ,  o?  !  x^  , 


*  Xn ’  yl 


...  yn) 


n  2  n  2  n  ,  ) 

?  log(l  p  )  ?  log  oi  ?  log  a. 


2dV>  Ui  V'X  *  <yi  M2)?A,2 


?  P  ^  <X1  “i  ’  (y.p<  1  )  y2,/0ia2! 


<2.4.1) 


A  constant  term  not  involving  the  parameters  has  been  omitted  in 
(2.4.1).  In  subsection  2.4.2,  we  shall  seek  ip's  that  maximize  the 
likelihood  such  as  this.  On  the  other  hand,  some  statisticians 
would  regard  >p  as  some  sor'  of  missing  dat-  and  not  as  a  parameter 
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of  the  underlying  model.  The  problem  of  pairing  the  two  files  will 
not  arise  in  such  situations.  However,  one  may  still  want  to  do 
statistical  inference  for  other  parameters  of  the  model  based  on  the 
broken  random  sample.  Such  issues  are  not  pursued  in  this  thesis 
and  one  may  refer  to  DeGroot  and  Goel  (1980)  for  an  approach  to 
estimating  the  correlation  coefficient  p  while  treating  as 
missing  data  in  the  bivariate  normal  model. 

2.4.2  The  Maximum  Likelihood  Solution  to  the  Matching  Problem 

We  start  with  a  bivariate  model  used  in  DeGroot  et  al .  (1971) 

T 

which  assumes  that  the  parent  probability  density  function  of  (y)  is 
h(t,u)  =  a ( t )  B(u)  exp[y(t)  6 C u ) ]  (2.4.2) 

where  a,  B,  y.  *>  are  known  but  otherwise  arbitrary  real  valued 

functions  of  the  indicated  variables.  Suppose  now  that  x  .  x 

1  n 

and  y^,  . . . ,  y^  are  the  observations  in  a  broken  random  sample  from 

a  completely  specified  density  of  the  form  (2.4.2).  If  x^  was  paired 
with  y  for  i  -  1,2,  ....  n,  in  the  original  unbroken  sample,  then 
the  joint  density  of  the  broken  sample  would  be 

n  n  n  n 

n  htx4-y  "  C  n  a(x  )][  n  B(y  )]exp[  l  y(x  )  A ( y  .  )] 

1--1  1  <p(U  1  =  1  1  1  =  1  1  1,1  1  *(1) 

(2.4  .3) 

Thus  t  he  maximum  likelihood  estimate  of  the  unknown  permut  at  ion  Is 

n 

the  permutation  for  which  X  ylx^)  6(y  (jj)  is  maximum.  Without 

1  1 

loss  of  generality,  we  shall  assume  that  the  x^'s  and  yj's  have  been 

reindexed  so  that  y(x,)  <  ...  <  y(x  )  and  6(y.)  <  ...  •  A(y  ). 

I  n  1  n 


T 

Since  (y)  Is  assumed  t.o  have  an  absolutely  continuous  dlstrlbut  ion, 
with  probability  one,  there  are  no  ties  among  » (  x  )  )  ' or  «.  <  y  j  >  '  '• 
lieilroot  et  al.  (1971)  shows  that  t  tie  maximum  likelihood  soluti  r  l  s 
:o  pair  x^  with  y^,  for  l  -■  1,  ....  n.  In  ot  her  words,  t  h»*  maximum 

likelihood  pairing  (M.L.P)  Is  <p"  =  (1,  n). 

In  particular,  if  the  density  In  2 .  4 .  2  is  that  of  a  bivariate 
normal  random  vector  with  correlation  o,  then  M.L.P, can  be  described 
knowing  only  the  sign  of  p.  If  p  >  0,  the  M.L.P.  is  to  order  the 

observed  values  so  that  x,  <  ...  <  x  and  y,  .  .  .  ■  y  and  then  to 

l  n  1  n 

pair  with  y^  ,  for  1  1,2 . n.  If  p  <  0,  t  tie  solution 

Is  to  pair  x,  ana  y,  ,  ,  ,  for  i  1.? . n.  If  p  n,  all 

pairings,  or  permutat  ions,  are  equally  likely. 

Chew  (1973)  derived  the  maximum  likelihood  solution  to  i he 

>' bivariate)  matching  problem  for  a  larger  class  of  densities  h  ( t  ,  u ) 

with  a  monotone  likelihood  ratio.  That  is,  for  any  values  t  ,  t  ^  , 

u  and  u,  such  ttiat  t,  <  t  and  u,  <  ti , 

12  12  12 

h(t.  , u  )  h(t  , n  )  ;»  h(t  , u  )  h(t  ,  u  )  (2.4.4) 

11  2  2  12  2  1 

As  before ,  we  shall  assume  that,  the  values  x  ,  ....  x  and 

1  n 

y  . .  ,  y^  in  e  broken  random  sample  are  from  a  density  h(t  , u) 

satisfying  (2.4.41.  Without  loss,  relabel  the  x’s  and  y’s  so  that 
x  .  .  .  -  x  and  y  j  •  .  <  y  ^  .  Then  permut  at  )  on  y>“  (  1  ,  .  .  .  ,  n ) 


Is  again  the  M.L.P. 
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2.4.3  Some  Bayesian  Matching  Strategies 

DeGroot  et  al .  (1971)  studied  the  matching  problem  from  a 

Bayesian  point  of  view  as  well.  They  proposed  three  optimality 

criteria,  subject  to  which  one  may  choose  the  matching  strategy  <p . 

Before  we  state  these  criteria,  we  need  some  notation  and  definitions 

Let  x, ,  ....  x  and  y, ,  . . . ,  y  be  the  values  of  a  broken 

1  n  1  n 

random  sample  from  a  given  parent  distribution  wltti  den;  i’y  h(t,u) 

If  x.  is  paired  with  y  ,,  ,  1  -  1,2,  ....  n,  then  the  likelUood 

1  <p(  1 ) 

function  of  the  unknown  permutation  .p  is  given  by  t  tie  equal  ion 


n 

L(<p)  =  n  h  ( t  ,  u  ,  1  . 

.  ,  1  <P<  1  > 


('2.4  .  b  ) 


Assume  that  the  prior  probability  of  each  permutation  is  1  Then 

n ! 

the  posterior  probability  that  ig  provides  a  completely  correct  set 
of  n  matches  is 


p  ( <p  1  L  ( <p )  /  X  I ,  ( ) 


(2.4  (,) 


For  j  1,2,  . . . ,  n ,  1 et 


*( 

j)  ! 

>p*  <t> : 

<p  (  1  > 

.11 

(2.4  f  \ 

he  t  he 

Set  of 

( n 

1 )  1 

permut  at  ion.;  wh  i  di 

per  l  f  y 

that  X]  is  to  he 

•pa  i  red 

w  1  t  ti  y 

,1 

Using 

t  he  def  1  n  1 1  i  oris  i  n 

'2.4  h  ) 

and  (2  4  7 ) ,  we  get 

the  posterior  probability  t hat  the  pairing  of  x;  and  y.  yields  a 
correct  match  to  he 


p )  -  X  p  (•<>).  1  <  j  <  n  (24.8i 

•<*  +  (  j  ) 


K-r  any  two  perniut.it  inns  p  and  >p  in  <J> ,  let 


v" 


V**  VW-/ 


M  ‘ 


K  ( i|> ,  <|» )  H  (  i  :  ip  (  i  )  vi»  (  i )  ) 


Th.it  Is,  K  <  **> ,  >v )  Is  t  tie  number  of  correct,  matches  when  t  tie  .  lu.erva 
i  ions  in  the  broken  random  sample  are  paired  according  to  ^  ami  t  lie 
vectors  in  the  original  sample  were  actually  paired  according  to  ip . 
It  then  follows  that  for  any  permutation  ipC4>,  the  quantity 


M  ( ip )  l  K  ( ip  ,  <p  )  p  ( tp  ) 

4><  $ 


( <• .  4  .  q  ) 


is  the  posterior  expected  number  of  correct  matches  when  <p  is  used 
to  repair  The  data  in  the  broken  random  sample. 

finally,  let  <S>  ^  tie  the  set  of  all  permutations  ip  such  Thai 

On,  b  bxm  V 

PeGroot  ,  Feder  and  Goel  (1G71)  iiave  proposed  three  optimality 
criteria,  subject  to  which  one  may  choose  the  matching  strategy  ip: 

11)  maximize  the  probability,  p(<p),  of  a  completely  correct  set  of 
n  matches, 

1 1 i >  maximize  the  probability,  p,,  of  correct  Ly  matching  x^  by 
choosing  an  optimal  ,j  from  {1.?,  ...  n )  and 

tiii)  maximize  tie*  expected  number,  M(<p)  ,  of  correct  matches  in  t  tie 
r-.-pa  1  red  samp  1 1  • 

Assuming  that  t  tie  bivariate  density  of  T  and  II  was  given  by 

in  ,J 

hi  ( t  ,n)  a  <  t  )  h  (  u  1  e  .  ft  ,u)  (R  ,  the  following  results,  among 
others,  wef'  >•:;!  ib  1  I  shed  by  Pedro*  a  et  a  I  i  1 v7 1  )  : 

(a)  The  M.L.P  <p“  maximizes  t  tie  probability  of  correct  pairing  of  all 
n  observations . 
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(b)  The  probability  of  pairing  x^(xf  )  correctly  is  maximized  by 


pairing  x,(x  )  with  y,(y  ). 

In  I  n 


(c)  The  class  of  permutations  4^  is  complete;  that  is,  given  any 


permutat  ion  -ptf^  ^ ,  there  exists  a  ipf.-J^  wiilch  is  as  good  as 


y>  in  the  sense  that  M  ( vp )  •  M(<p). 


(d)  Sufficient  conditions  in  terms  of  the  data  x  ,  ....  x  and  v  , 

t  n  1 


y^  for  the  M.I..P  <p"  to  maximize  M(ip)  were  also  given. 


The  results  in  Chew  (1973)  and  Goel  (1975)  are  extensions  of  (a) 
through  to  (d)  to  an  arbit rary  bivariate  density  h(t,u)  possessing  t  lie 
monotone  likelihood  ratio.  The  "completeness"  property  in  (c)  implies 


E  F. 

that  the  permutation  <p  maximizing  M(ip)  satisfies  ip  ( 1 )  -  1  and 


E  E 

ip  (n)  =  n,  for  n  =  2 ,  3 ,  ip“  =  <p  .  DeGroot  et  al .  (1971)  show  that  for 


E  . 


n  •>  3 ,  <p  is  not  necessarily  equal  to  the  M.L.P  <p“  by  means  of  a 
counter  example. 


?  H.H  Matching  Problems  for  Multivariate  Normal  Distributions 
In  our  review  so  far,  we  have  discussed  optimal  matching 


-  >11 

}\? 

and  ) 

1 

>Vl 

>v. 

i-  Q 


1  1 


I 


$ 

al 

i 


% 

% 

;S 


I 

r- . 

A 


-U 

\ 

% 


$ 


each 

1  e  in 

i  p  *  q ) 

a 

i 

mat.  r  1  x 

xi 

form : 

:% 

(*  t. 

•> 

.since  the  other  factors  in  the  joint  density  of  the  sample  do  not 
depend  on  <*> .  If  we  again  assume  that  the  prior  probability  of  each 
permutation  <p  is  then  the  posterior  probability  that  (p  provides 

a  completely  correct  set  of  n  matches  is  given  by  (?.4.6).  Thus, 
max  i  mi  /.  i  ng  p  ( «p )  is  equivalent  to  max  imlz  ing  L(<p),  or  equivalently 
m  1  n  i  mi  z  i  rig 


There  is  no  simple  way,  in  general,  to  describe  the  maximum  likeli 
hood  sol ut ion . 

However,  if  rank  1,  then  rank  ( Q  ^  ? )  1  and  can  be 

represented  in  ttie  form  C2  ^  ^  -  a'b,  where  a  and  b  are  vectors  of 
dimensions  p  x  1  and  q  x  1.  If  we  let  y(>o)  =  a'jp  and  : 


f  or  i 


1,?,  ....  n,  the  will  be  the  permutation  that  minimizes 


.*  ^  >// 


Q(«P)  -  r(xt)  4(*v(l)) 


(2 .4  .  12) 


Now,  minimizing  (2.4.12)  is  achieved  by  arranging  y(xj)'s  from 
smallest  to  largest,  arranging  Afy^i's  in  the  reverse  order  from 

the  largest  to  smallest  and  then  pairing  the  corresponding  elements 
in  the  two  sequences. 

Suppose  next  that  rank  (J2  )  >  2.  Without  loss  of  generality, 

we  shall  assume  that  p  <  q  and  let  v^  =  ft^y  ,  for  J  =  1,2 . n. 

Then,  both  to  and  v^  are  p-dimensional  vectors,  and  the  maximum  likeli 
hood  solution  will  be  the  permutation  that  minimizes 


Q(>p)  1  x;  v 

i.l  ~i  ^ 


Let  D  denote  the  n  x  n  matrix  ((d,,))  whose  elements  are  d  x’v 

ij  ij  i  ) 

Then  minimizing  (2.4.14)  is  equivalent  to  minimizing 


Q(«p)  l  d.,  a 

i =1  j-1  J  J 


subject  to  the  constraints 


1 ,  for  j  -  1,2, 


)  a  l ,  for  1  ) , 2 , 

.)  1  1 


a  0  or  l , 


V  V  VA  A  ,V 
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which  Is  a  standard  assignment  problem  with  cost  matrix  D  Although, 
there  Is  no  simple  form  for  the  solution  of  an  arbitrary  assignment 
problem  of  this  type,  efficient  algor Ithms  are  available  for  finding 
numerical  solutions. 

£ 

The  permutation  g  that  maximizes  the  expected  number  of' 
correct  matches  Is  very  difficult  to  calculate  when  p  and  n  are 
moderately  large.  No  efficient  algorithms  are  known.  A  Monte  Carlo 
study  was  reported  by  DeGroot  and  Goel  (1976)  in  which  they  compare 

E 

g  and  g"  for  p  2  and  60  different  covariance  matrices  £  with  the 

sample  size  n  -  3 ,  A  and  6.  In  all  cases,  the  proportion  of  samples 
F. 

tor  which  g  and  g"  were  Identical  was  between  0.926  and  0.996. 


fhus .  it  is  not  unreasonable  to  use  g"  even  when  the  goal  Is  to  maxi 


mi  ze  t  tie  expected  number  of  correct,  matches. 

DeGroot  and  Goel  (1976)  studied  two  other  simple  matching 

strategies  which  provide  good  approximations  to  the  M.L.P  g“  or  to 
E 

the  rule  g  .  We  shall  not  discuss  them  here.  In  the  rest  of  this 
chapter,  we  shall  discuss  matching  problems  only  In  the  bivariate  case. 

2.6  He Ji i ab  1U .ty...of _ Matching  . St rategles  for  Bivariate  Data 

T  T 

Consider  a  random  sample  of  size  n,  l,,1),  .  ..,  (,,n).  from  a 

U1  un 

bivariate  population  with  density  h(t,u).  If  the  pairings  in  this 

sample  are  lost  before  the  entire  data  was  recorded,  we  still  can 

observe  t  tie  marginal  order  statistics.  Tn  fact.  If  x,,  ....  x  and 

1  n 

y^,  . . . ,  y  Is  the  broken  random  sample  corresponding  to  the 

T 

unobserved  sample  on  (y),  Then  clearly  the  order  statistics 


<  ...  <  x^n)  °f  the  x's  are  exactly  the  same  as  the  order  stat 


lstlcs 

T(l)  <  - 

"  <  T(n) 

of  the  T ' s . 

Similarly , 

the  order- stat istics 

Y(l) < 

Y<2)  <  ' 

'  <  Y(n) 

are  the  same 

as  Vi  •= 

. . .  <  U,  , .  The 
(n) 

repairing  of  the  x's  and  y's  was  introduced  In  Section  2  .  4 .  Thus 
for  each  permutation  n>  In  there  is  a  matching  strategy  and  the 
typical  merged  file  consists  of  the  pairs 


(y(1>  )  ,  i  =  1,2 . n.  (2.5.1) 

Some  optimal  matching  strategies  were  discussed  in  Section  2.4. 

Here,  we  are  concerned  with  the  quality  of  the  file  in  (2.5.1). 

Ideally,  we  would  like  to  choose  a  <*>  for  which  the  file  m 

T 

(2.5.1)  recovers  all  the  (y)  pairs  that  we  did  not  observe.  [t  is 
therefore  natural  to  look  at  the  random  variable  the  number 

of  correct  matches  due  to  <p  or,  equivalently,  the  number  of 
unobserved  sample  points  which  have  been  recovered  in  (2.5.1).  It 
should  be  pointed  out  that  M(<p) ,  which  was  defined  in  Section  2.4.3, 
is  different  from  E[N(<p)]  because  the  former  quantity  is  a  posterior 
expected  value  given  a  particular  broken  random  sample  and, 
iri  the  latter,  the  expectation  is  taken  over  all  possible  samples. 

Situations  often  arise  where  it  is  not  crucial  that,  after  the 
two  flies  are  matched,  the  matched  pairs  are  exactly  the  same  as  the 
pairs  of  the  original  data.  For  example,  when  contingency  tables  ar** 
contemplated  for  grouped  data  on  continuous  variables  T  and  U,  we 
may,  in  the  absence  of  the  knowledge  of  the  pairings,  would  like  to 
reconstruct  the  pairs  but  would  not  worry  too  much  as  long  as  the 


2.6  An  Qpt_i._ma  1  .  *  y  Property  of  the  Maximum 
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I.lkel  i  h>r.,(l  Pairing  <p“ 

The  known  results  about  tie  optimality  of  the  maximum  likelihood 
pairing  >p"  (1,  ....  n)  with  respect  to  some  ftayes  i  an  criteria 

were  reviewed  in  Section  2.4.  Here,  we  shall  propose  a  new  criterion 
and  establish  that  <p"  is  optimal  with  respect,  to  that  criterion 
Consider  the  random  variable  N(v>),  the  number  of  correct 
matches  which  result  when  a  permutation  >^>  in  $  is  used  to  merge 
the  broken  random  sample  from  a  bivariate  population.  In  this 
section,  we  shall  show  that  <p"  maximizes  E(N(»p)),  the  expected 
number  of  correct  matches,  provided  that  the  parent  density  V t  ,u) 
exhibits  certain  dependence  structures. 

We  begin  with  <juot  tng  a  very  useful  result  on  'tic  exchange 
ability  of  random  variables  from  Handles  and  Wolfe  (lu/b). 

Lemma  2.6.1:  rf  £  d  n  and  K(  • )  is  a  measurable  fun<t  ion  (  poss  '•  h  1  y 
v  e<  •  t  or  valued)  defined  on  the  common  support  of  t  her.e  random  vectors, 
t  hen 

K (£)  d  K ( n ) 

We  now  establ  ish  a  represent  a’  ion  for  N ( <p .  t  )  as  a  sum  of 
e \t  Pangea  b  1  e  (ternoullt  random  vari. titles  ,  which  will  t  >e  asefa  1  f .  r 
e x  t  t'lii  1  i  ng  results  of  Yah. tv  1  I  'IH . '  )  . 

Theorem  2.6.1:  I .et  N  ( /  and  V  (•(.,<!  he  as  deg  i  ne<|  by  (  g  .  S  .  2  )  .  n<l 

til 

(2.2.4)  respectively.  Then 


V  <p  in  ♦,  Nlipif)  -  Z  (tp,e), 

where  the  summands  are  exchangeable  random  variables. 

Proof:  The  order  statistic  U,  and  the  concomitant  U  of  T  , 

(.(p(i))  lt|  (l) 

used  in  {?.'>.?)  can  be  written  in  I  erms  of  ranks  of  T’s  and  tl  ’  s  as 
f o l Lows : 


1  ( «p ( 1 )  )  =  \  Ua  [(R„  -  v(D) 

a  -  1  fex 


'll]  ■'  lx  Ua  T(R  n 

a  -  1  La 


(2.6.2) 


(2.6.3) 


Note  that  N  { <p .  c  )  is  simply  a  count  of  how  many  pairs  in  ttie  merged 
n  1 e  due  to  v.  namely , 


)  .  i 

u(<p(  i )  ) 


,  2 ,  ....  n 


sat  l  st  y 


‘win  um 1  '  ‘ 


(  2  .  6  .  'I  ) 


(2.6.6) 


If  (2.6.6)  holds  for  some  i,  then  3a  j  such  that 

|ll,  ,  .  ,  LI  |  <  r 

( <p  (  t  )  )  J 

In  view  of  t  tie  continuity  of  (T^.U^),  ttiis  correspondence  is  one  to 


one.  Therefore,  t  tie  count  N(<p,c)  must  be  the  same  as  the  count  given 


by 


N{»p.e>  =  l  I 


( I d  -  u*  I  <  c) 


(2.6.6) 


1  =  1  '|U(<p(Rn>)  ul'  - 


Hence,  (2.6.1)  holds  by  virtue  of  the  definition  (2.2.4)  of  vni . 


Towards  showing  the  exchangeability  of  the  V^'s,  n°le  that  the 


original  sample  in  (2.6.6)  are  independent  and  Identically 


distributed  vectors.  Hence,  using  the  equal  in  distribution 


notation,  we  get 


(U  ,  - W  )  a  (w  - y  ) 

~al  ~an  "  ~1  ~n 


(2.6.7) 


where  (a  ,  ....  a^)  is  an  arbitrary  permutation  of  (1,2 . ri) 


Define  a  function  f  =  (f^,  ....  f  )  from  R2n  to  Rn  by  the  equations 


\  1  lf  1-1  '  "V,  1(>j ■«V>0|’  ‘  '(Hj  »,.>  - 


11  J  i 


11  J  1 


if  otherwise 


j  -  1,2,  ....  n , 


(2.6.8) 


where  is  the  matching  strategy  we  started  with  and  (a^.b^, 


a^ ,  bfi )  is  an  arbitrary  point  in  R 


It  follows  from  (2.6.7)  and  Lemma  2.6.1  that 


f (U  ......  W  )  °  f (W. , 

—  —a 1  —a  n  —  ~1 


(2.6.9) 


Fix  j  as  an  Integer  in  (1,2,  ....  n).  Then,  using  (2.6.H)  we  see 


I  fiat  f.(W  ,  ,  .  .  .,  W  )  is  the  indicator  fund  ion  of  the  event 

j  ~al  an 


A  X(«l  ,  U  >f  )  ,,,(  J-  r(T  T  •(>)  *  '  \  1  ( n  H  •  .  )  . 

1  1  a.)  i  11  a.)  1  1  L  a  J  1 


or,  equivalently,  in  terms  of  the  ranks  R 


Because  ( 01  ^ ,  ....  an )  is  an  arbitrary  permutation  of  1,2,  ....  n, 

we  conclude  from  (2.6.10)  that  the  summands  In  (2.6.6)  are  exchange 
able  random  variables. 

larj£.  2  ..6_.  1 :  The  number  of  correct  matches  resulting  from  the 
matching  strategy  has  the  representation 


N  ( <p ) 


(R?l  »((.„)) 


<?.<>.  11) 


Proof:  Set  c  .  0  In  Theorem  2.6.1.  n 

Ue  will  need  the  following  special  dependence  structures  Tor 
the  population  density  h(t,u).  (see  Shaked  1979). 

bef in i t ion  (2.6.1):  (exchangeable  random  variables  T,U  are  said  to 


vj 


PP! 

,v 

.V 


*> 


'/ 
b  j 
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be  positive  dependent  by  mixture  (PDM)  iff  the  joint  distribution  of 
T , U  is  that  of  gC^Q,^)  and  g(£0,%2),  where  £  and  \2  are  i.i.d 
random  variables,  is  a  random  vector  which  is  independent  of  f 
and  ^2  and  g  is  a  Borel  measurable  function. 

Definition  (2.6.2) :  Exchangeable  random  variables  T,U  are  said  to 
be  positive  dependent  by  expansion  ( PDE )  iff  the  joint  distribution 
of  T  and  U  admits  the  following  series  expansion: 

dH(t.u)  =  [1  *  l  a1n1(t)n1(u)J  dF(t)dF(u)  (2.6.12) 

where  F ( • )  is  the  marginal  CDF  of  T  or  U,  a^'s  are  nonnegative  real 
numbers,  and  (n^  is  a  set  of  functions  satisfying 

OO 

J  nl(x)  dF( x )  =  0,  i  =  1,2,  ....  (2.6.13) 

-CO 

According  to  the  Definitions  2.6.1  and  2.6.2,  the  dependence 
concepts  will  apply  only  to  pairs  of  exchangeable  random  variables. 

It  may  also  be  noted  that  for  most  of  the  known  expansions  of  PDE 
distributions,  the  set  of  functions  ( n k ( * ) }  satisfies,  in  addition  to 
(2.6.13),  the  orthogonality  conditions 

CO 

\  nk(x)na(x)  dF  ( x )  -  6  .  (2.6.14) 

OO 

where  k,  i.  1,2,  ....  and  6^  is  the  kronecker  delta. 

We  now  give  two  examples  to  Illustrate  these  concepts  of 
dependence . 


Exjample  2.6.1:  bet  £q,  b?  be  1.1. d  standard  normal  random 


variables.  bet  p  be  any  constant  In  the  Interval  (0,1 j.  Define  new 


random  variables 


T  7 1  p  •  b  j  '  7  p  b  0 


U  -  /r  P  •  L  *  7 p  b. 


Then,  It  Is  easy  to  verify  that  T,u  are  jointly  normal  and  that  the 


definition  (2.6.1)  can  be  applied  to  T  and  U  with  the  above  choice 


of  by.  and  •  Hence,  the  standard  bivariate  normal  distribution 


with  nonnegative  correlation  has  tlie  RDM  property. 


Also,  Mardia  (1970,  p.  48)  gives  the  following  series  expansion 


for  the  bivariate  normal  density 


h(t.u)  -11*  Ip  n  ( t )  n.  (u)J  f(t)  f(u), 
k- 1  k 


(2.6.  I  '> ) 


where  f(t)  is  the  density  of  the  univariate  standard  normal  random 


variable  and  { n k ( *  )  }  is  a  set  of  orthonormal  llermite  polynonomi als . 


Thus,  if  p  >  0,  bivariate  normal  distributions  possess  the  PDE 


property  as  well 


Example  2.6.2:  A  class  of  bivariate  densities  due  to  fa  rite  flumbel 


Morgenstern  is  given  by  the  formula 


h(t,u)  1  •  i*  (  1  2 1  )  (  l  2u),  wiiere  0  ^  t  ,  u  <■  1 


(2.6.16) 


It  Is  easy  to  check  that  T  and  U  are  PDE  for  a  >  0  In  (2.6.16). 


Note  that  the  expansion  2.6.16  has  only  a  finite  number  of  terms, 


unlike  the  expansion  for  the  bivariate  normal  distribution. 


V 

y 

v,  V 

y 

■*  i/ 


•V 

•V  V 
.* 

V 

V 

«V  N* 


I 


y.  y- 


We  now  prove  that  the  PDM/PDE  structures  are  inherited  by  a  pair 


of  new  variables  obtained  front  a  given  .sample  by  computing  the  same 
function  of  the  marginals.  These  results  are  general izat  ions  of 
theorems  in  Bhaked  (1079),  which  w ere  proved  only  for  n  ? .  However, 
mathematical  induction  does  not  help  to  show  the  results  for  an 
arb it  rary  n . 


Theorem  2.6.?:  Let  •'  y  )  .  i  1,2,  .  .  .  ,  n  be  a  random  sample  from  a 

PL>M  parent  with  density  h(t,u).  Then,  for  any  measurable  function 


g :  R  ■-*  R,  the  random  variables  .  .,  T  )  and 


g(U  U . )  are  jointly  PDM. 


i 


Proof:  By  hypothesis,  the  vectors  (y^)  are  i.i.d,  furthermore,  "inn 

PPM  pr  operty  is  defined  only  for  exchangeable  pairs  of  random 
variables,  we  have 


(T  ,IJ  >  a  (II.  ,T  )  .  1  1  rt 

111! 


r;  6  \  f) 


Equal ion  (2.6.17)  together  with  the  independence  of  T,U  pairs  yield: 


(Tl’  ’  Tn "  Ui . V  °  (V  V  •  Un  *  V  *  '  Tn) 


(  ?  6  1  Hi 


Consoler  the  function  K:R  >  Rn  defined  by  the  equat  ion 

K  (  a  .  ,  a  ,  b  ,  ....  b  )  ( g  ( a  ,  ,  a  )  ,  g  ( h  .  ,  b  i  ) 

1  n  1  n  1  n  I  n 

wti'  •[•»•  (  .i  .  ,  a  ,  t>  ,  .  ,  b  )  is  any  point  in  R"  "  Awm  i  ini'  •  in- 

1  ri  1  n  ' 

I  nut  ion  K  to  both  sides  of  (  t>  .  is)  .mi)  invoking,  l.omma  .'  >•  1  w  got 


(26.  I ')  ) 


Hence.  (g(T).  g(U))  Is  an  exchangeable  pair  of  random  variables. 

The  POM  property  of  ( T  ^  ,  U  ),  i  1,?.  ....  n  further  implies 


that  there  exist  n  i.i.d.  vectors  (f  ,E  ),  1  -  1,?,  .. 

•*0 1  1  i  /I 

a  measurable  function  f  such  that 


n  and 


(1)  for  each  j ,  are  univariate  random  variable: 


and  the  vector  F  .  is  Independent  of  F  and  F 

'c0  j  i  ' 


1.1 


(11)  For  each  J 


m 


iyW  and  u  =  r(K?y$o,) 


(?.().?  0) 


Introducing  the  random  variables, 


l  .  %m 

li  2 


’  ?  i 


and 


a0 


(  ^ 1 2 ’  ‘ ' ^ In ’  ^22  ’ 


'  ' ' ’  ^2n ’  ^01’  ' ‘ ’  ^On’ 


On 


(2.6.21) 


We  find  that  FT  and  FT  are  i.i.d  univariate  random  variables  and  F" 
l  2  u 


is  Independent  of  F,“  and  F"  In  view  of  the  assumptions  (i)  and  (ii) 


Note  that  (2.6.20)  and  (2.6.21)  imply  that 


k<t>  K(m.|1.i,1)1i 


f,t.n'W 


is  a  measurable  function  g"  ,  say,  of  5,“  and  .  Similarly,  g(ll)  is 
also  the  same  function  g"  of  the  random  variables  F"  and  f,  “  .  Hence, 
by  definition,  g(T)  and  g(U)  are  POM.  IJ 


VoV' 


S  .V.V-, 


4  3 

The  next  theorem  is  similar  to  Theorem  2.6.2  except  the  parent 
distribution  has  the  PDE  property. 

ri 

Theorem  2.6.3:  Let  (  ),  i  =1,  ....  n  be  a  random  sample  from  a  PDE 

u  i 

parent.  Then,  for  any  measurable  function  g  R  >  R,  t  tie  random 

variables  g ( T ,  ,  .  .  .  ,  T  )  and  g(U,  ,  ....  1)  )  are  PDE  . 

In  In 

Proof:  The  exchangeab  1 1  i  t  y  of  the  joint  distnbut  ion  of  g(T)  and 

g(U)  has  already  been  proved  in  Theorem  2.6.2  (see  equation  2.6. 10). 

It  remains  to  be  shown  t  hat  ,  when  the  joint  density  of  ea>  h  of  the  ri 

copies  of  T,U  admits  an  expansion  of  the  type  2.6.12,  the  joint 

density  of  g(T)  and  g(U)  also  admits  a  similar  expansion. 

Assume  therefore  ttiat  there  exists  nonnegative  constants  {a  ) 

k' 

and  a  set  of  orthonormal  functions  { n k ( • ) }  such  that  the  joint  density 

of  T.  and  U.  is  of  the  form, 
i  I 


dH  ( t  .  ,  u  )  dM  t.  .  ) dF ( u  ) [ 1  ♦  )  a  n,(t  )n.  u  )  , 

i  i  t  1  ,  ,  k  k  t  k  i 

k  1 


(2.6.22) 


For  any  real  x,  define  the  measurable  set  in  R 


where  1  1,2 . n 

n 


A  ( x )  --  l(x  ,  ....  xn>:  g(xlt  ....  xn)  <  x) 


Then ,  the  distribution  function  Q,  say,  of  (g(T).g(lJ))  is 


n 

Q( x ,  y )  =  J  ...  J  f  ...  J  n  dH(  t.  u  ) 
UA(x)  uf-A(y)  j  =  l  J  J 


(2.6.23) 


Using  t  tie  expansions  In  equation  (2.6.22)  we  get 


'14 


Q(x.y)  -  Q ( x ) Q ( y )  ♦ 


„  v  (  1 )  ,  .  ( 1  )  ,  , 

"  \xk  (x>xk  ,sl 


CX>  CX> 

<">  k\  t\  vsvi  <x)s>> 


X  \ 
k  =i  S 

n 


( n ) 

ak  Xk 

n  S’ 


,  .  ( n ) 

k  ,x,xk,.. 
n  1 


(2.6.24) 


where  Q(  x )  .  J  ...  |  n  (i F ( t  ) 

A  (  x )  14  1 


x„  (X)  :  !  .  .  I  n  (t  )  n  dF(t. ) 

A  ( x )  K  1  i  1  1 


xk  ^(x)  J  ...  I  nk(t]  )n4(t?)  11  dFU^ 

A  (  x )  'll  * 


x'  .  (X)  J  .  .  .  J  n  n.  (t.  )  n  dF  ( t  ) 

l  ’  ’  '  ’  n  A ( x )  1-1  i  1  1-1 


(  2 . 6 . 2  6  ) 


Note  that  V  k^  1,2,  ...  and  V  i  =  1,2,  .  ...  n  the  signed  measure 


induced  by  *  (x)  is  absolutely  continuous  with  respect  to  Q 

1 — S 


so  that  there  exists  >ji  (x)  -  the  Radon  Nikodym  derivative 

*1 . ki 

(1) 

of  x  (x)  with  respect  to  Q  such  that 


xkU  k  U)  =  '  *1"  k  (t)  d®CU 

1* -  t  -®  1 - -  l 


(2.6.26) 


Hence,  from  equations  (2.6.24)  to  (2.6.26)  we  get 


dQ(x.y)  dQ(  x  )dQ(  y  )  [  1  ♦  n  £  a  4/  1  1  (  x  )  1  >  ( y ) 

k=l  *  K  k 


O  *  ^  ak  ak  ^k^k  (x)u,k?)k  (y) 

k  =1  k2=l  1  K2  1’  2  V  2 


r  r«  (n)  .  .  (  fl  )  .  . 

*  l  -l  a  . . .  a  «*.  <x)*k  ,,  (y > 

k  =1  k  =1  1  n  1 .  n  1 .  n 

1  n 

(2.6.27) 

Representation  (2.6.27)  holds  almost  everywhere  (Q  measure)  because 
Radon  Nikodym  derivatives  are  defined  up  to  sets  of  measure  zero. 
Also,  the  coefficients  in  (2.6.27),  being  products  of  the  nonnegative 
a^'s,  are  themselves  nonnegative.  Hence,  to  complete  the  proof  we 
only  have  to  show  that  the  orthogonality  conditions  (2.6.11)  hold  for 
the  4>^'s  of  the  expansion  in  (2.6.27) 

For  l  1,2,  ....  n ,  and  1  <  k^ ,  ....  k  <•.  <»  , 
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marginal  d  )  st r 1 but  ion  F(»)  of  T  so  that 
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where  S.  1 ,  2  ,  .  .  .  n 

and  this  completes  the  proof. 

The  following  facts  about,  bivariate  ranks  are  easy  consequences 

of  Theorems  26.2  and  2.6.3. 
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Corollary  2.h.l:  i.et.  (  )  be  a  random  sample  from  a  PDM  (PDF) 
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of  T  and  U  respectively,  where  1  -  1,2,  ....  n.  The  pair  (  )  Is 

1  1  R?1 


PDM  ( PDE ) ,  i  =  1,2,  ....  n. 


Proof:  Fix  1  and  define  a  function  g:  R  ►  R  by  the  equation 
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and  observe  that 
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By  invoking  Theorems  2.6.2  and  2.6.3,  the  result  follows.  u 

We  need  one  more  result  before  we  establish  an  optimality  property 
of  <p". 
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Theorem  2.6.4:  Let  random  vectors  ),  1  1,2,  ...  n,  be  PDM/ PDE 

ui 


and  denote  the  ranxs  of  T^.U^  among  T^'s  and  U  '  s  by  R^.R,^  respec 


tively.  Consider  the  joint  probability  mass  function 
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>f  t  and  R?1-  Then,  it  *  s  satisfy  the  following  inequalities: 
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Proof:  By  hypothesis,  the  parent  distribution  is  PDM  or  PDE.  Accor 


ding  to  Corollary  2.6.1,  R  and  R  are  also  PDM  ur  PDE.  Cotise 
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ptently,  R  and  R()  are  exchangeable  random  variables.  Hence, 
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To  establish  (2.6.10),  first  consider  the  ease  when  T  and  II  are  [M)M 

By  Theorem  2  .  5  .  2  ,  R  and  R  ,  are  POM.  Hence  ,  there  exists  a 
j  11  21 

distribution  function  Q(*)  say,  such  that 
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where  tt  ^ .  ( t  )  and  ir.j(t)  are  the  conditional  mass  functions  of  R^j 
and  R  ,  ^iven  a  value  t  1 rom  the  Q  distribution. 

It  follows  from  equation  (2.6.32)  that 
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We  thus  obtain  (2. 6. JO)  when  T,U  are  PDM .  Suppose  now  that  T  and  U 


are  POE.  Then,  by  virtue  of  Corollary  P.6.1,  R  and  R,,^  would  be 


HUE.  R  and  R^  are  ranks  that  are  based  on  independent  random 


variables,  hence,  R  and  R.^  are  both  discrete  uniform  random  variables 


on  1,2,  ...  n  (see  Randles  and  Wolfe  (1070),  p.  18). 

As  R  and  have  finite  supports  the  series  expansion  of 


and  R  will  have  a  finite  number  of  terms.  In  fact,  fisher's 
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identity  (see  Lancaster  (1969),  p.  90)  holds: 
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1  <  i  .  j  <  n 


(  2  .  h  ! '!  I 


where  { ak }  are  nonnegative  constants  and  (n|<(*))  are  orthogonal 


functions  on  1,2,  ....  n.  The  representation  (2.6.33)  leads  to  The 


following  reasoning: 


For  1  <  i,  j  <  n , 
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( 2 .6  .  in  ) 


Hence,  we  obtain  the  inequalities  in  (2.6.30).  An  optimality  of 


property  (p"  can  now  be  established: 


Theorem  _2 . 6 . 6 :  I.et  (  ),  1=1,2,  ....  n  be  as  in  Theorem  2.6.4. 


Then ,  V  tp  (  4> , 
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IToof  :  In  Corollary  2  .  6  .  L  ,  N(y>)  was  written  as  a  sum  of  exchang'-ab  1  < 
indicator  random  variables.  lienee,  using  equation  P.6.11,  we  get 


E( N (<p)  )  .  nP(R21  -  <p(Ru>) 


(2  .6. 36) 


n  I  P(R21  -  <P<k>.  Ru  -  k) 
k  =  l  "  1 
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k  .(*>(  k ) 


where  n  Is  the  joint  mass  function  of  Rji.Rgi  Invoking  the 
inequalities  on  ir  In  (P.6.10)  we  obtain 


E(N<„,)  <  n  ^  *-  (,k  k  .  „(k)>/2 


1  "  1  " 

nC2  Vk  +  2  VkJ.^k)1 
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f-:  (  n  ( - ) ) 


Which  establishes  the  desired  result. 

To  interpret  Theorem  2.6.6,  we  first  recall  from  subsection  2.4.2 
that  <p"  =  (1,2,  ....  n)  is  M.L.P  if  the  parent  density  has  the 
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monotone  likelihood  ratio  (MLR)  property.  As  demonstrated  by  Shaked 

(1979),  there  Is  no  general  relationship  between  PDM/PDE  concepts  of 

positive  dependence  and  the  MLR  property.  We  can  therefore  state  the 

optimality  of  <p"  In  Theorem  2.6.6  as  below: 

Let  T.U  have  a  Joint  density  that  has  MLR  property.  In  addition, 

let  T  and  U  be  either  PDM  or  PDE  random  variables.  Let  ....  x  , 

1  n 

y  ,  ....  yn  be  a  broken  random  sample  from  the  T  U  population.  Then 
the  M.L.P  <p"  Is  an  optimal  strategy  to  match  the  x's  with  the  y's 
in  the  sense  of  maximizing  the  expected  number  of  correct  matches. 

2.7  Mono  ton  l  c  1  ty  of_.E(N(#“)) 
with  Respect^  to  Dependence  Parameters 
Repairing  of  broken  random  samples  based  on  the  available  data 
In  two  files  was  discussed  in  Section  2.H.  It  was  observed  that 
data  based  optimal  matching  strategies  exist  when  data  come  from 
populations  having  certain  types  of  positive  dependent  structures. 

It  is  therefore  reasonable  to  expect  an  optimal  matching  strategy  to 
perform  better  when  there  is  some  kind  of  positive  dependence  in  the 
population  than  when  the  data  In  the  two  files  are  stochastically 
Independent.  Our  objective  in  this  section  is  to  present  a  precise 
account,  of  such  Intuitive  results  with  regard  to  the  maximum  llkeli 

hood  pairing  .  To  this  end,  we  will  draw  upon  the  results  of 

Section  2.6.  We  begin  with  a  definition  from  Shaked  (1979): 

Def lnl t ion i  2 _.  7 . 1 :  Let  J  be  a  subset  of  R.  A  kernel  K  defined  on  JxJ 

is  said  to  be  conditionally  positive  definite  (e.p.d)  on  ixJ  iff 
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(I)  K(x,y)  -  K(y,x),  V  x.y  €  J;  that  Is  K  is  a  symmetric  kernel 

(II)  Let  m  be  any  positive  Integer.  For  arbitrary  real  numbers 

a  ,  ....  a  and  for  every  choice  of  distinct  numbers  x,  ,  .... 

1  m  1 

x  from  J,  It  holds  that 
m 


)‘  X  K(x  ,x  )  a  a  >  0  whenever  }’  a  0 

111  i  J  1  J  l  a 


It  Is  pertinent  to  note  that  this  definition  Is  related  to  the 

well  known  concept  of  a  positive  definite  kernel,  which  Is  used  In, 

among  others,  the  theory  of  characteristic  functions.  The  nonnega 

m  m 

ttvity  of  the  quadratic  form  X  I  K(x  x  )  a. a,  without  requiring 

11  ]1  J  1  J 
m 

the  condition  £  a  -  0  In  (2.7.1)  Is  a  standard  way  of  defining 
1  1 

positive  definite  kernels  (Wldder,  1941,  p.  271).  We  shall  now  give 

an  example  of  a  c.p.d  kernel  which  will  be  used  In  the  sequel. 

Example  2.7.1:  Let  J  (1,2 . n)  ,  where  n  Is  a  fixed  positive 

integer.  To  verify  that  the  kernel  K(x,y)  [,  .  is  conditionally 

positive  definite  on  JxJ,  let  m  be  a  positive  integer.  For  arbitrary 

real  numbers  a,,  ....  a  and  for  every  choice  of  distinct  integers 

1  m 

1  ,  ,  ....  1  f  rom  J ,  we  have 
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where  we  have  used  the  fact  that,  in  view  of  the  integers  i 

being  distinct,  i  i„  iff  a=8. 

a  d 


(2.7.2)  , 

...  i 


Note  that  we  did  not  have  to  impose  the  condition  £  a  0  to 

i-1  1 

arrive  at  (2.7.2).  Also,  the  function  I(x=y)  is  clearly  symmetric  in 
x  and  y.  Hence,  it  follows  from  (2.7.2)  that  K(x,y)  is  positive 
definite  and,  consequently,  is  also  c.p.d. 

We  will  need  the  following  lemma. 

Lemma  2.7.1  (Shaked,  1979):  Let  T  and  U  be  PDM  or  PDE  random  vari 
ables  with  joint  distribution  function  H(t,u).  Letting  F(*)  stand 

for  the  common  marginal  distribution  of  T  and  U,  define  H  (t,u)  t 

o 

F(t)*F(u),  the  distribution  function  of  T  and  U  in  the  case  of 
independence  of  the  variables.  Then  we  have  the  ordering 

Eh(K(T,U))  >  E  (K(T,U))  (2.7.1) 

o 

iff  K (  .  ,  .  )  is  a  c.p.d  kernel,  provided  the  expectations  exist. 
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Proof:  It  follows  from  the  general  representation  of  N(4>)  in 

equation  (2.6. 11)  that 


VNI  "  Vll  -  R21>  ’  "  Eh‘(K<R,1'R21»> 


(2.7.6) 


where  K(x,y)  =  I,  .  Now,  recall  from  example  2.7.1  that  K(x,y)  Is 
( x  =y ) 


c.p.d.  on  the  domain  JxJ,  where  J  =  {1,2,  ....  n}  Is  the  common 


support  of  and  R  .  It  was  established  in  Theorems  2.6.2  and 


2.6.1  that  K  and  R  are  PDM  (PDE)  according  as  T  and  U  are  POM 


(PDE).  Invoking  Lemma  2.7.1,  we  therefore  obtain 


KH(K<RH-R2, 


o 


(2.7.6) 


Under  H0,  and  are  Independent.  Also,  these  ranks  are 

marginally  discrete  uniform  random  variables  on  1,2,  ....  n.  lienee, 


we  get 


EH  (K(R11-R21I>  ■  PH  <R11  *  R21> 
o  o 


>;  P(R  -  k)  P(R?1  k) 
k  - 1 
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>:  2 

k  - 1  n 


1  /n 


(2.7.7) 


Equations  (2.7.6)  to  (2.7.7)  imply  the  desired  inequality: 
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We  conclude  from  (2.7.4)  that  <p"  provides,  on  the  average,  more 


/>; 


I 


f. 

f. 


correct  matches  when  the  data  in  the  two  flies  come  from  certain 

positively  dependent  populations  than  when  they  are  Independent.  In 

particular,  this  fact  holds  for  the  bivariate  normal  distribution 

with  positive  correlation  as  well  as  for  Morgenstern  distributions 

in  Equation  (2.6.14),  where  the  dependence  parameter  a  >  0.  In  the 

light  of  Theorem  2.7.1,  it  is  natural  to  conjecture  that  E  (N),  as  a 

H 

functional  of  the  distribution  function  H,  is  order  preserving  with 
regard  to  certain  partial  orderings  of  the  space  of  all  continuous 
bivariate  distributions  which  have  fixed  marginals  (those  of  T  and  U) 
and  exhibit  positive  dependence.  Although  no  proof  of  this  conjee 
ture  is  available  at  this  time,  we  offer  further  evidence  in  support 
of  this  conjecture  in  the  next  two  theorems. 

Theorem  2.7.2:  Suppose  that  a  broken  random  sample  comes  from  the 
family  of  densities  given  by  the  equation 

h(t,u)  1  ♦  <x  (1  ?t)(l  2u),  0  <  t,  u  c  1  and  0  >  a  < 1  (2.7.H) 


Then,  Ea(N)  is  monotone  increasing  in  a. 

Proof:  Note  that  in  (2.7.8),  a  0  means  T  and  U  are  independent 

and  we  might  say  that  the  farther  a  is  from  0  the  more  the  positive 

dependence  between  T  and  U.  For  this  family,  the  marginal  dlstribu 

t  ions  of  T  and  U  are  uniform  on  (0,1). 

Tt  follows  from  equation  (2.6.27)  and  Corollary  k.6.1  t hat  the 

joint  probability  function  of  t  h<‘  ranks  R  and  R  can  be  ennonl 
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where  i,.)  1,?,  .  ...  n  and  (n^*)}^  Fs  a  set  °f  functions  satisfy 

!  ng  the  or  t  hogonal  1 1  y  conditions  In  (2.6.13).  Using  the  expression 
<2  .7.9)  for  u  we  get 
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(2.7.10) 


here,  after  change  of  the  order  of  summations  on  1  and  k,  we  have 
red  nonnegatlve  constants  b^  given  by  the  equation 


>:  ( n.  (  n  )  .  k  l  .2,  ....  n 

1  1  * 


it  follows  from  (2.7.10)  that  Ea(N)  is  a  polynomial  In  a  and  hence 
ir  increases  with  a,  as  a  goes  from  0  to  1 . 

Theorem  2 . 7 . 3 :  Suppose  that  a  broken  random  sample  comes  from  the 
bivariate  normal  distributions  given  by  (2.6.16),  where  we  assume 


co  n 

*  [n  .  n-P  l  1  (*<1,<U)2 

n  k  =  1  11  * 
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(2.7.1?) 


where  the  order  of  summations  over  1  arul  k}  ,  . . ,  kn  have  been 

reversed  because  the  terms  In  the  expansion  (2.7.11)  are  all  non 

negative.  Ue  conclude  from  (2.7.12)  that  E  (N)  is  a  polynomial  in 

p 

p  and  hence  It  increases  with  p  as  p  goes  from  0  to  1 .  □ 

As  we  close  this  section,  we  shall  state  a  result  due  to  Chew 

(1973)  which  somewhat  resembles,  though  conceptually  different  from, 

the  inequality  EU(N)  >  1  In  (2.7.4).  Recall  the  notation  M(<p)  in 
H 

(2.4.9),  which  denotes  the  posterior  expected  number  of  correct 

matches  due  to  the  strategy  <p .  Arguing  that  M(v>)  1  when  n>  Is 

randomly  chosen  f  rom  1> ,  he  proved  the  following  result: 

Theorem  2.7.3:  (Chew,  1973  ):  !.**t  x,,  ...  x  and  y  ,  ....  y  be  a 

1  n  l  n 

broken  random  sample  from  a  bivariate  distrihut  ion  possessing  mono 

tone  likelihood  ratio.  If  x  <  ...  <  x  and  y  <  ...  <  y  ,  then  the 

1  n  l  n 


posterior  expected  number  of  correct  pairings  using  the  M.L.P  ip"  is 


at  least  unity,  that  is 


M( <p"  )  >  1 


( ? . 7 . 13  ) 


It  should  be  noted  that  the  inequality  (2.7.13)  was  derived 


from  a  Bayesian  perspective,  whereas  in  our  inequality  ( 2 . 7  .  't )  the 


expectation  is  over  all  possible  samples.  Finally  note  that  while 


our  comparison  is  between  dependent  and  independent  populations  for 


the  M.L.P. ,  Chew's  inequality  compares  M.L.P  with  random  pairing. 


2.8  Some  Proge r ti e s  o [  N (ip',  c  ) 


The  maximum  likelihood  pairing,  <p"  ,  was  introduced  in  sub 


section  2 . 4 . 2  and  some  of  its  small  sample  properties  were  studied 


in  Section  2.7.  Specifically,  the  behavior  of  E( N ( tp"  )  )  was  discussed 


while  holding  the  sample  size  n  constant  and  changing  only  the  degree 


of  dependence  in  the  population.  We  shall  now  fix  the  parameters 


describing  dependence  in  the  population  of  (n)  and  allow  n  to  tend  to 


infinity  in  order  to  study  the  behavior  of  N(<p",e).  Later,  in  this 


section,  we  shall  present  the  results  of  a  Monte  Carlo  study  about 


N(<p",c)  in  which  we  vary  the  dependence  parameters  even  as  n  takes 


different  values. 


In  this  section,  the  notations  of  Section  2.2  will  he  used 


freely.  Recall  that  N ((*>")  and  N(ip",t)  have  the  shorter  notations  N 


and  N(c)  respect  1 vely .  We  start  with  a  review  of  Yahav  (lU8?)'s 


results  concerning  E(N(c)). 


msn 


Wl 


Assuming  that  the  distribution  of  T  and  U  is  such  that  the  eon 

dltlonal  distribution  of  U  given  that  T  t  is  (univariate)  normal 

with  mean  t  and  variance  1,  Yahav  (1982)  derived  the  limiting  value 

of  u  (c)  -  E(N(c)/n),  as  n  a  mi  by  using  the  representation  (2.b.2) 

n 

in  which  the  summands  are  functions  of  the  order  statistics  of 

IJ  ,  ...  I)  and  the  concomitants  of  the  order  statistics  of 

1  n 

T  ,  . . . ,  T  .  His  proof  relied  on  an  approximation  theorem 

(Biekel  and  Yahav ,  1977)  about  the  order  statistics  for  the  above 

model.  Furthermore,  he  reported  the  findings  of  a  Monte  Carlo  study 
for  a  particular  case  of  tils  model,  namely,  T  and  U  are  bivariate 
normal  with  correlation  p 

First,  we  discuss  the  large  sample  behavior  of  NU)/n  in  case  of 
samples  from  an  arbitrary  population.  The  properties  of  its  expected 
value  are  available  as  a  consequence.  Second,  we  indicate 
how  Yahav  *  s  simulation  study  of  the  small  sample  properties  of  w  n ( r  > 
can  be  improved  upon.  We  shall  then  present  the  results  of  our  own 


Monte  Carlo  study  of  y  (e)  when  n  is  small. 

n 


Theorem  2.8.1:  For  broken  random  samples  from  an  absolutely 

N(r. )  Pr  ,  . 

continuous  distribution,  *  y(c),  as  n  ► 

n 


(2.8.2) 


where  |jf  f  )  P(F(T  r)  "  C(  II)  <  F  (  T  ♦  r  )  )  . 

Proof:  het  L  Recall  the  representation  (2.6.b)  for  N(c)  as 

n  n 


a  sum  of  exchangeable  indicators: 


It  follows  that 


E(L  )  =  nP(  A  ( c )  )/n  =  P(A  Ac)) 
n  n  l  n  l 


(2.8.4) 


Note  that, 


?  -2  (2) 

E(l/)  =  n  LE(N(  e )  ) '  +  E(N(e))]. 


(2.8.!>) 


where  E(N(e))  is  the  second  factorial  moment  of  N(t).  Using  the 


exchangeable  representation  (2.8.3)  again,  we  get 


2  2(2) 

E(l,  )  ^  n  fn  'P(A  .(c)A  (£))  ♦  nP(A  ,U))]  . 

n  nl  n2  nl 


'"t  ■’!«  -  Ual 


^  •»  l »  2  *  •••.  n, 

2a  U1  2al 


(2.8.6) 


where  the  sequences  {5;.^^}  smd  {?.,  are  defined  in  (2.2.12) 


Using  (2.8.6),  we  get 


A  .(f)  (n.,/n  <  0,  n  ,/n  <  0) 

n 1  11  21 


(28./) 


2  2 


A  At)  A  (t  ) 
nl  n2 


O  O  ( n  /n  <  0) 

i -1  J  1  J 


(2.8.8) 


"i  •  r. 

n  /n  *  f!(ll  )  F  ( T  *  c  )  (2.8.1?) 

.'<1  U  <1 

Wr, IT!*  r,  I  ,  2  . 

Using  t  h>-  fart.  (see  Uerfling,  19^0  p.  ‘>2  )  that,  a  .sequence  of 
vi-ct  rin;  r.in vi-iy/:;  almost  surely  to  a  given  v«-(.:l.or  Iff  the  component 
wise  sequences  ronvprge  almost  surely  to  the  appropriate  components 


of  the  limit,  we  get  from  (2.8.11)  and  (2.8.12) 


nu/n 


n?1/n 


nl/n 


/  KtTl  f 

)  t;  ( u 

1 

i  G  (  U  ^  ) 

F(T  ♦£ 

F(T  r 

>  a  ( u 

1  ? 

V 

\  n(U2> 

F(T2«r. 

(2.8.13) 

and  the 

(  2 . 8  .  H  ) 


It  follows  from  (2.8.7),  (2.8.8),  (2.8.13)  and  the  Independence  of 

T  T 

(y*)  and  ( y  ^ )  that 


P(  A  Ac))  >  p  (  c  ) 
n  1 


(2.8,  i 4  ) 


P(A  (C  )A  Ac))  ->  vi  (c 
ill  n2 


(2.8  IM 


Using  (2.8./»),  (2.8.c>)  ,  (2.8.14),  (2.8.15)  it  is  easy  to  verify  that, 
as  n  **>, 


K(L  )  »  p ( c ) 

n 


(2.8.16) 


var(L  )  »  0 

n 


It  is  well  known  that  (2.8.18)  implies  the  convergence  in  pr<  .hah  i  1  i  t  y 

in  (7.8.2).  il 

The  following  corollary  generalizes  Yahav  (Ida?)*:;  result  concerning 

y  (  t  ) ,  the  first  moment  of  N(c)/n. 
n 


Corollary  2.8.1;  For  p  •  0, 
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(  1 )  ^  lc  ).  *  g  ( t  )  t  as  n 

n 


(it)  E(NU)/n)P  »  l  w  ( <  )  J  P .  as  n*». 


(.’  .HA  n 


(2.8.  18) 


Proof  The  number  of  t  correct  matches  can  at  most  bo  n,  the 
number  of  pairs  in  t  tie  unobserved  bivariate  data.  Hence, 


N  f  t  ) 

f)  '  '  c  t  ,  V  n  1  ,  2 .  ... 

n 


In  other  words,  {N(e)/n}  is  a  uniformly  bounded  sequence  of  random 
variables.  It  is  well  known  that  convergence  in  probability  and  L 

P 

convergence  are  equivalent  for  such  sequences.  Hence,  (i)  is  an  easy 

'•onsequence  of  Theorem  2.8.1.  It  follows  from  (i)  and  Theorem  4 .  ‘j .  '4 

of  Chung  (1974)  t hat  the  p  **  moment  of  N(e)/n  converges  to 

[ij(t)Jp.  Hence  (ii)  also  holds.  rj 

Note  ttiat  no  assumption  about,  the  conditional  distribution  of  IJ 

given  T  was  made  either  in  Theorem  2.8.1  or  Corollary  2.8.1. 

Yahav  generated  samples  from  a  bivariate  normal  parent  with  mean 
0 

vector  (q)  and  covariance  matrix 


P  /( 1  p  ) 


P'7  (l  p') 


2  2 
P  /U  p  ) 


1/(1  p  ) 


(2.8.19) 


Note  that  in  (2.8.19)  the  variances  of  T  and  U  are  functions  of  the 
correlation  of  T  and  IJ  because  Yahav  requires  that  the  conditional 
distribution  of  U  given  T  -  t  be  normal  with  mean  t  and  variance  1. 


*>"*>,,>',^~>'w>  w>  -?  '>  ’?  -7  t*’  ~  7-777-7 _-7  .-7.-7.- 7^77. -y^rv.  v;  v.  v-: vj-«-4-t-. 
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The  limiting  value  of  y  (c)  for  his  particular  model  was  given  by 

n 

the  integral 


y(c)  =  \  {*(x 


;)  -  $(  -  “)}  d*(x) 

✓  1  +p 


(2.8.20) 


✓  1  +p 

He  computed  y(c)  by  numerical  integration  for  c  -  0.01,  0.05,  0.1, 


0.3.  He  also  provided  Monte  Carlo  estimates 

of  Vi  (  e  )  , 
n 

for  n  -  10, 

20  and  50  using  the  simulated  data  on  T  and 

U.  The  following  table 

is  a  typical  example  1  rnm  iiis  tables. 

Table  2.1 

Expected  Average 

Number  of 

c-Correct  Matchings,  c 

=  .01 

(Yahav  (1982)) 

P 

pl0(c) 

P20U) 

VJ50U) 

y(c) 

.01 

.5864 

.5326 

.52752 

.52269 

.01 

.1984 

.  1648 

. 12712 

.11522 

.10 

.1512 

.1058 

.07600 

.05912 

.30 

.  1084 

.0686 

.03888 

.02144 

.50 

.  1020 

.0582 

. 02720 

.01382 

.70 

.0960 

.0614 

.02616 

.01051 

.90 

.0972 

.  0540 

.02064 

.00864 

.95 

.0976 

.0496 

.02144 

.00829 

.99 

.0960 

.0484 

.02128 

.00804 

It  is  clear 

from  Table 

2.1  that  \i  (c)  and  p(c)  are 
n 

deereas i ng 

as  p  ranges  from 

0.01  to  0. 

99.  However,  one 

expects  that  an  optimal 

strategy  such  as 

<p"  has  the 

property  that  y^ 

( c )  as  well 

as  y  (  c  )  are 

monotone  increasing  in  p. 

The  problem  here 

is  not  with 

the  M.L.P, 

<<>”  ,  but  with  Yahav’s  model  in  (2.8.19)  because,  as  the  correlation 


4 
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U<!*.^iA.UWiL<IL,r^,  A,*  ^  a/l  A-*  */  IlV  C  ^ 


VI 

«£l 

V 

VI 

vi 

s 

M 

I 


h 

•o 

» 

4 

a 


% 

:? 

:a 


3 


;$4 

I 

*»L 

ft: 

.  > 

*  • « 

8 

•ft 

ft; 

vN  .*• «  /* yw*JS 

K."  ^IL*. 


changes  its  value,  so  do  the  marginal  variances  of  T  and  IJ .  To 
rectify  this  problem,  we  assumed  a  bivariate  normal  model  for  T  and  1) 
in  which  the  means  were  zero  and  the  covariance  matrix  was 


(2.8.21 ) 


For  each  combination  of  four  values  of  n,  namely  10,  20,  50  and  100, 
and  twelve  values  of  p.  namely  0.00,  0.10  (0.10),  0.90,  0.95,  0.99, 
a  sample  of  size  1000  was  generated  from  the  bivariate  normal  popula 
tion  using  the  1MSL  subroutines.  These  data  were  used  to  obtain 
Monte  Carlo  estimates  of  p^(e),  where  c  was  given  the  values  0.01, 
0.C5,  0.1,  0.3,  0.5,  0.75,  1.0.  Furthermore,  it  is  easy  to  show 
that,  for  the  model  in  (2.8.21), 


.(c)  P (  |  Z  |  <  t//2TT  pT), 


(2.8.22) 


where  2  Is  a  standard  normal  random  variable.  It  is  clear  from 
(2.8.22)  that  u(c)  is  a  monotone  increasing  function  of  p.  Using 
standard  normal  CDF  tables,  v(e)  in  (2.8.22)  was  computed  for  each 
combination  of  the  twelve  values  of  p  and  the  seven  values  of  e 
mentioned  above.  We  have  presented  the  estimated  values  of  wn(e) 
and  the  limiting  value  u(c)  in  Table  2.2  to  Table  2.8. 


Table  2.2  Expected  Average  Number  of 


c-Correct 

Matchings,  e 

=  0.01 

p 

Mio(c) 

W20U) 

%)U> 

uioo(c> 

V  (  c  ) 

0.00 

0.106 

0.054 

0.025 

0.015 

0 . 008 

0 . 10 

0.113 

0.059 

0.028 

0.017 

0 . 008 

0 . 20 

0 . 127 

0.068 

0.031 

0.018 

0 . 008 

0.30 

0.138 

0.075 

0.034 

0 . 020 

0 . 008 

o 

JT 

O 

0. 166 

0.083 

0.038 

0 . 023 

0 . 008 

o 

i r\ 

o 

0.174 

0.095 

0.044 

0.026 

0 . 008 

0.60 

0.199 

0 . 109 

0.051 

0.030 

0 . 008 

0.70 

0 . 231 

0 . 129 

0.061 

0.036 

0.008 

0.80 

0.279 

0 . 162 

0.077 

0.046 

0.016 

0.90 

0 . 374 

0.222 

0 . 109 

0.067 

0.016 

0.96 

0 . 476 

0 . 296 

0 . 151 

0.094 

0 . 024 

0.99 

0.700 

0.521 

0.299 

0.191 

0.056 

-  *, 

Table 

2.3  Expected 

Average 

number  of 

e 

C 

-Correct  Matchings,  c 

-  0.06 

P 

pio(c) 

w20(e) 

W50U) 

P100(C> 

p(c  ) 

0.00 

0.127 

0.076 

0 . 04  7 

0.03  7 

0.032 

9 

0 . 10 

0.134 

0.082 

0.051 

0.040 

0.032 

y 

0 . 20 

0 . 149 

0.093 

0.056 

0.043 

0.032 

0 . 30 

0 . 161 

0.099 

0.061 

0.047 

0 .032 

V 

0 . 40 

0. 180 

0 . 109 

0.066 

0.052 

0 . 040 

V 

0.50 

0.201 

0. 124 

0 . 074 

0.057 

0 . 040 

0.60 

0.228 

0.141 

0 . 085 

0 . 065 

0 . 048 

T, 

0 . 70 

0 . 262 

0 . 166 

0 . 101 

0.076 

0 . 048 

fc,* 

«  J 

0.80 

0.317 

0.205 

0.124 

0 . 094 

0 . 064 

'  " 

0.90 

0.420 

0 . 280 

0.174 

0 . 135 

0 . 088 

0.95 

0 . 529 

0.368 

0 . 237 

0 . 186 

0.127 

0.99 

0.769 

0.631 

0.459 

0 .377 

0 . 274 

V 

V 

V 

y 


Tab  1*' 

? . A  Expected 

Average 

Number  el' 

L 

Correct  Matchings,  l 

0.1 

I 

gio(c) 

»20U) 

P60U) 

P  (  £  ) 

LOO 

u  (  c  ) 

Sc 

00 

o .  ion 

0 . 10? 

0.076 

0 . 066 

C  .  066 

10 

0.  160 

0.110 

0 . 080 

0 . 069 

0 . 066 

/•/ 

?0 

0.1/7 

0. 1?1 

0.087 

0 . 074 

0 . 064 

30 

0  .  L  89 

0  .  1  to 

0 . 093 

0 . 080 

0 . 064 

no 

0  .  ?  1 0 

0.143 

0. 101 

0 . 088 

0 . 072 

60 

0 . 1*  3  A 

0.161 

0.11? 

0 . 096 

0 . 080 

60 

o .  ?64 

o  .  l  a  i 

0 . 1?7 

0  .  108 

0 . 088 

"V 

70 

0 . 30? 

0  .  ?  1 0 

0  .  1  49 

0 . 1  ?6 

0  .  103 

a! 

ao 

0 . 36  3 

0  .  ?  6  8 

0 . 18? 

0  .  1  64 

0 . 1  ?7 

*>: 

90 

0.477 

0 . 347 

0  .  ?64 

0  .  ?18 

0  .  1  74 

9b 

0 . 69 n 

0 . 46? 

0. 34? 

0 . 299 

0  .  ?6  l 

«> 

A 

99 

0 .8  39 

0 . 744 

0 , 0  80 

0 . 680 

0.6?? 

/ 

Table  ?.5  Expected  Average  number  of 
c  Correct  Matchings,  c  0.3 


llioU) 

g?o(t) 

g60U) 

l,iooU) 

u  (  t  ) 

*„■< 
> 
*  m 

00 

0  .  ?66 

0 . 708 

0 . 1  84 

0  .  176 

0.166 

V" 

1  0 

0  .  ?  6  6 

0 . 7?  3 

0 . 196 

0.186 

0  .  174 

■O' 

7  0 

0 . 784 

0.7)  7 

0 . 707 

0.19  7 

0.190 

to 

0  .  1 06 

0 .7  68 

0.771 

0  711 

0.19  7 

4  0 

0  .  t  3 4 

0 . 7  76 

0  .  ?40 

0 . 7  79 

0.713 

ft 

60 

0  .16  3 

0 . 304 

0 . 763 

0  .  ?60 

0 .736 

60 

0 . 401 

0 . 336 

0  .  ?93 

0 . 7  78 

0 . 766 

7", 

70 

0 . 4  66 

0 . 38? 

0 . 3  37 

0  .  170 

0 . 303 

80 

0.63? 

0 . 467 

0 . 403 

0 . 386 

0 . 36? 

90 

0.670 

0 . 693 

o 

o 

0.619 

0 . 497 

96 

0 . 802 

0.733 

0.689 

0.674 

0.658 

•V 

99 

0.978 

0.968 

0.961 

0.961 

0.966 

to*, 

- 


Table 

2.6  Expected 

Average 

Number  of 

l 

c 

-Correct  Matchings,  c 

-  0.5 

g 

p 

vioU) 

y20U) 

W50U> 

W100U) 

w  (  c  ) 

$ 

0.00 

0.353 

0.311 

0.290 

0 . 281 

0.274 

0.10 

0 . 367 

0 .330 

0 . 306 

0.298 

0 . 289 

0 . 20 

0 . 390 

0 . 348 

0.325 

0.315 

0.311 

0 . 30 

0.417 

0.371 

0 . 344 

0. 336 

0.326 

ft 

0.40 

0.452 

0 . 400 

0 . 373 

0 . 362 

0. 354 

y 

\j  i 

0.50 

0.485 

0.437 

0 . 404 

0.393 

0.383 

0.60 

0.528 

0.478 

0 . 446 

0.435 

0.425 

0.70 

0 . 591 

0.536 

0 . 506 

0.495 

0 . 484 

•$ 

0.80 

0.675 

0.628 

0.594 

0 . 584 

0.570 

k"J 

0.90 

0.811 

0 . 773 

0.752 

0 . 744 

0 . 737 

0.95 

0.917 

0.896 

0.888 

0.885 

0 . 886 

!  S' 

0.99 

0.998 

0.999 

0.999 

0.999 

1 .000 

Table 

2.7 

Expected 

Average 

number 

of 

C 

-Correct  Matchings,  e 

=  0.75 

P 

w 

10U) 

u 

20 (  C  * 

P50U) 

W100U) 

V 

(c  ) 

0  . 

.00 

0 

.468 

0 

.  433 

0 

.416 

0  . 

409 

0 

.  404 

0  . 

10 

0 

.  488 

0 

.  454 

0 

.43  7 

0  . 

429 

0 

.  4  2  5 

0  . 

.  20 

0 

.514 

0 

.  477 

0 

.  461 

0  . 

453 

0 

.4  45 

0  . 

.  30 

0 

.  539 

0 

.505 

0 

.  487 

0  . 

480 

0 

.  471 

0  . 

.  40 

0 

.  582 

0 

.  542 

0 

.  522 

0  . 

514 

0 

.  503 

0  . 

.  50 

0 

.621 

0 

.  586 

0 

.  560 

0  . 

555 

0 

.  547 

0 

.60 

0 

.662 

0 

.633 

0 

.613 

0  . 

606 

0 

.  599 

0  . 

70 

0 

.  727 

0 

.694 

0 

.  679 

0  . 

673 

0 

.668 

0 

.80 

0 

.810 

0 

.  786 

0 

.  772 

0  . 

768 

0 

.  766 

0 

.90 

0 

.919 

0 

.908 

0 

.906 

0  . 

904 

0 

.907 

0 

.95 

0 

.9  79 

0 

.976 

0 

.  978 

0  . 

979 

0 

.9^7 

0 

.99 

1 

.000 

1 

.000 

1 

.  000 

1 . 

000 

1 

.000 

Table  ?.8  Expected  Average  Number  of 
(  Correct  Matchings,  r  1.0 


p 

MioU) 

v?oic) 

W60U) 

pioou) 

p  (  c  ) 

0 . 00 

0 . 6/0 

0 . 6  A  6 

0.631 

0 . 62A 

0 . 6?2 

0  1 0 

0.603 

0 . 666 

0.666 

0 . 6 AO 

0 . 6a7 

0  .  ?0 

0 . 6/1 

0 . 606 

0.681 

0.676 

0.670 

0  .  10 

0 . 6 A  6 

0 . 6/2 

0.611 

0 . 606 

0 . 606 

0  .  A 0 

0 . 600 

0 . 66  A 

0.660 

0 . 6A A 

0 . 6? 7 

0  .  60 

0 . 7/0 

0 . 707 

0.601 

0.688 

0 . 683 

0 . 60 

0 . 77? 

0 . 76  3 

0 . 7A A 

0 . 7  A 1 

0 . 737 

0 . 70 

0.830 

0.81? 

0 . 807 

0.806 

0 . 803 

0  .  HO 

0 . 808 

0 . 880 

0 . 887 

0.886 

0 . 886 

0 . 00 

0 .0  70 

O  .  070 

0.07? 

0.07? 

0.076 

0  .  06 

0 . 006 

0 . 006 

0.007 

0.007 

0 . 008 

0  00 

1  .  000 

1  .  000 

1 . 000 

1 . 000 

1  .  000 

Note  that,  as  expected,  vn(e)  is  a  monotone  increasing  function 
of  f,  for  each  fixed  c.  Furthermore,  the  quality  of  the  merged  file  is 
quite  good  if  we  want  to  recreate  contingency  tables  with 
intervals  of  size  .6a  or  more  and  the  correlation  p  is  >  0.6. 
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2 . 9  Poisson  Convergence  of  N(<p“) 


Let  us  revisit,  for  a  moment,  the  card-matching  problem  which 


was  discussed  in  Section  2.3.  Some  of  the  dlst ribut ional  properties 


of  the  number  of  correct  matches  in  randomly  arranging  one  pack  of 


cards  against  another  were  stated  in  Proposition  2.3.1.  In  part  1c 


ular,  the  well  known  approximation  of  the  distribution  of  the  number 


of  correct  matches  by  a  Poisson  distribution  with  mean  1  was 


mentioned.  This  Poisson  approximation  may  be  motivated  by  the 


observation  that  the  occurrence  of  a  match  tends  to  be  a  rare  event 


when  the  number  of  cards  in  the  matching  problem  grows  indefinitely. 


Inspired  by  this  result,  it  is  natural  to  ask  whether  Poisson  distri 


buttons  can  approximate  the  distribution  of  the  number  of  correct 


matches  due  to  data  based  matching  strategies.  The  answer  is  in  the 


affirmative  in  the  case  of  the  maximum  likelihood  pairing  <♦>" .  Our 


aim  in  this  section  is  to  establish  the  Poisson  convergence  of  N(ip") 


Using  the  general  representation  in  Corollary  2  6.1  for  the 


number  of  correct  matches,  we  can  write 


N  N(«p")  =  l  r 


(2.9.1) 


1-1  ni 


where  An±  =  ( =  Rp^),  i  =  1,2,  ...,  n  are  exchangeable  events.  It 


follows  that  E(N)  -  nP(A  .).  Zolut lkhina  and  I.atishev  (1978) 

n  1 


sketched  a  proof  of  the  fact  that  the  expectation  of  N  converges  to  a 


constant  as  n  tends  to  ■».  Their  approach  starts  with  writing  P(A  1 

n  1 


as  the  triple  Integral 


1  \  f  exp[  ( n  1  )9.n(s(  x.y  .0) )  ]d0dll(  x  ,y ) 


CD  “>  0  0 


where  s(x,y,0)  -  p3(x,y)  ♦  2V'pjTx,y)p  (x,yT  •  cc 


P  j  (  x  ,  y  )  F  (  x  )  H  (  x  .  y ) 


P?(x,y)  G(y)  H(x,y) , 


and  P3 ( x ,y )  =  1  -  P1<x,y)  -  P2(x,y).  V  x,y  e  R  0  <  6  <  «  . 


Using  the  well  known  method  of  Laplace  (Bleistein  and  Handlesman 


1975),  they  expanded  this  Integral  in  powers  of  ^  and  concluded  that 


)  ~  n  for  large  n,  where  the  constant  a  is  given  by 


a  -  J  [ h ( x , G  lF(x)  )/h?(G" lF(x))]dx 


(2.9.?) 


They  concluded  that,  in  large  samples,  E(N)  ~  a. 


In  this  section,  we  shall  generalize  the  result  of  Zolutikhlna 


and  Latishev  (1978)  by  showing  that  the  d  factorial  moment  of 


N,  E(N^d^),  converges  to  a^,d  >  1,  under  certain  conditions  on  the 


distribution  of  (y).  As  a  consequence,  we  shall  obtain  the  weak 


-•onvergence  of  H  to  the  Poisson  distribution  with  mean  a. 


We  begin  with  the  observation  that  the  ranks 


R.  -  (R,.,  ■  R,  )  and  R_  -  (R„,,  ....  R„  )  are  invariant  under 

1  11  In  "2  21  2n 


increasing  functions  of  T  and  U  respectively.  For  this  reason,  N  is 


also  invariant  under  such  transformations.  Without  loss  of  general 


hy,  we  therefore  replace  T  and  U  by  F  ( T )  and  G(1J)  respectively. 
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where  F(G)  is  the  marginal  distribution  function  of  T(U).  This  so- 

called  probability  integral  transformation  allows  us  to  assume  that 

T  and  U  are  marginally  uniform  random  variables  and  that  the  parent 

CDF,  H(t,u),  is  the  joint  CDF  of  F(T)  and  G(U).  Furthermore,  the 

1 

integral  (2.9.2)  simplifies  to  a  =  J  h(x,x)dx.  We  might  recall 

0 

from  Section  2.2  that  this  simpler  version  of  a  was  called  X..  We 
shall  henceforth  use  these  simplifications  and  seek  to  prove  that  N 
weakly  converges  to  the  Poisson  distribution  with  mean  X. 

Following  Schwelzer  and  Wolff  (1981),  the  joint  CDF  of  F(T)  and 
G(U)  will  be  called  a  copula .  In  general,  a  copula  is  denoted  by  the 


V. 

v»*, 

*  v 

yj. 

P 

•  t*  < 
$ 


m 

■  w) 


i 


t 

m  - 


symbol  C(.,.)  and  the  following  Frechet  bounds  apply  to  any  copula: 

max( x+y- 1 , 0 )  <  C(x,y)  <min(x,y),  V  (x.y)  €  [ 0 , 1 ) 2  (2.9.3) 

However,  for  the  purpose  of  deriving  the  distribution  of  N,  we  shall 
consider  only  a  part  of  the  spectrum  (2.9.3)  of  all  possible  copulas. 


8 


To  motivate  our  choice  of  the  copulas,  first  note  that,  in  this 
chapter,  only  absolutely  continuous  joint  densities  are  allowed  for 
T  and  U.  This  means  that  the  extremes  min(x*y  1,0)  and  min(x,y)  are 


ruled  out  because  these  copulas  correspond  to  degenerate  joint 


(V 


,s 


k 


distributions  for  T  and  U  (Mardia  1970,  p.  32).  Second,  Goel  ( 1 Q 7 ‘i ) 
has  observed  that  <p"  =  (1,2,  ....  n)  is  M.L.P  iff  the  joint  density 


yy 

y. 

.'-y 

yj 


rw 

Sf 


of  T  and  U  has  the  M.L.R  property.  However,  M.L.R  property  neces 

T 

sarlly  implies  that  the  distribution  function  of  (y)  must  be  such 
that  C(x,y)  >  xy,  for  all  (x,y)  in  the  unit  square  (Tong  (1980), 
p.  80).  We  shall  henceforth  assume  that  the  Joint  CDF  of  T  and  U  will 
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X? 

•> .' 
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satisfy  the  Inequalities 

xy  <  C(x,y)  <mln(x,y),  V  (x,y)f_  [0,1]?.  (2.9.4) 

Note  that,  In  (2.9.4),  T  and  U  are  Independent  Iff  C(x,y)  '  xy 


SS 
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Positive  dependence  of  T  and  U  occurs  when  C(x,y)  >  xy,  for  all  x  and 
y.  In  the  remainder  of  this  section,  the  joint  CDF  of  T  and  1)  will 
be  a  copula  C  in  the  class  (2.9.4)  and  the  corresponding  joint  density 


3 


„  4 


function  will  be  denoted  by  c(x,y). 

Since  R  and  are  some  permutations  of  (1,2,  ...,  n),  we  find 

It  convenient  to  use  the  notation  <p  for  realizations  of  R^  or  R  . 

The  common  support  of  R^  and  R ^  is  denoted  by  $,  the  set  of  n! 
permutations  of  1,2,  ....  n. 

We  will  now  formally  establish  an  equivalence  between  the  card  matching 
problem  and  the  M.L.P  In  the  independence  case. 

Proposition  2.9.1:  Let  T  and  U  be  independent  random  variables. 

Then  the  distribution  of  V  =  (V  V  )  defined  In  (2. 2. hi  is 

~  nl  nn 

the  same  as  that  of  the  vector  <S  E  (6,,  ...,  <5  )  where 

-  1  n 


6  . 

ni 


i  )  ’ 


1,2 . n 


(2.9.5) 


Furthermore,  the  random  variables  6^,  ....  <$ri  are  exchangeable. 

Proof:  Note  that  t  tie  rank  vectors 

Sl  '  (R,1 . "in'  ',ml  5?  '  <"?1 . R?n> 

are  Independent  because  T  and  U  are,  by  hypothesis,  independent 
random  variables,  and  1  hat  R^  and  R^  are  discrete  uniform  on  ip 
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That  Is, 


P  ( R  =  <p)  =  — ,  V  <p  e  4>  and  a  =  1,2. 
~a  n ! 


(2.9.6) 


As  Vnl’s  are  Indicators  of  the  occurrence  of  matches,  the 

Bernoulli  variables  4  ,,  ....  4  In  (2.9.6)  can  be  looked  upon  as 

nl  nn 

ind leat Ing  whether  R  matches  with  1  or  not,  1  =  1,2,  ....  n.  It  is 
clear  that  the  common  support  of  V  and  4  Is 


A  =  {(a .  a  ):a  =0  or  1,  1=1,2,  ....  n,  V  a,  k  n  1} 

ini  1.1  1 


(2  9.7) 


Note  ttiat  A  has  2n  n  sample  points. 


Let  a  =  (a^,  •••»  ap)  be  a  fixed  but  otherwise  arbitrary  point, 

in  A.  Oefine  the  events 

D(a,«p)  -  [*  e  ^  V  1  1>? . n]> 

(2.9.8; 

where  <p  e  <&.  Then,  using  the  independence  of  R ^  and  and 


(2.9.8)  we  get 


P(V  =  a)  --  P(I._  _  =  a.  .  1  -  1,2 . n) 

~  ~  <Rli  R?i}  1 


K  P(I(ru^(D)  V  1  l,2< 


n|R?  y>) 


k  P(I(Ru^(i))  ‘  3i  ’  ‘  :  l'2' 


E  P(R  C  D  (  a  ,  <*> )  ) 


(  2  .  9  .  9  ) 


We  now  observe  that  She  components  of  a  dictate  which  posit  ions 


if'  <p  ( <+> ( 


,  <p(n)  )  must  he  matched  or  mismatched  by  any  permu 


tat  ion  In  order  that  i}>  c  D(a,ij>).  Clearly,  the  number  of  ways  In 
which  we  can  permute  the  Integers  1,2,  ....  n  and  produce  t|T  s  that 

tie  long  t  o  b(a,y>)  depends  only  on  the  fixed  vector  a  and  I  he  fact  that 
<4>  Vs  ari  arrangement  of  n  dist  inet  integers.  Hence  t  tie  cardinality  of 
U(a,y>)  does  not  change  as  >p  ranges  over  4> .  In  particular,  D(a,y>) 
and  D(a,ip")  have  the  same  number  of  sample  points,  where 
<p”  •  (1,2,  n).  Using  (2.9.6),  we  therefore  obtain 


HHj  f  L>(a,.j.)  )  i  i  I't  (  U (  a  ,<*>“)) ,  V  v>  f  * 


(2.9.  10) 


The  right,  hand  side  expression  in  (2.9.10)  is  a  fixed  number  d<-pen 
ding  or;  <?"  and  t  he  chosen  a.  To  is  mean::  that  in  (2.9.9),  we  seek 


the  expected  ion  of  a  degenerate  random  variable.  Hence,  we  otitain 


!>(V  a)  l’(K  (  0  (  a  .  tp"  )  ) 


1  (  1  (  Ft  l  )  'V  ‘  l*? . 10 

1  i 


P  (A  a ) 


(2.9.11) 


Uecau.se  a  was  arbitrarily  chosen  from  A,  we  finally  infer  from 
(2.9.11)  t  hat 


(  V  V  )  11  (  A  A  ) 

ni  mi  ril  nri 


(,’.9.12) 
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The  exchangeability  of  <S  ,  ...»  4^  follows  from  the  fact  that  the 

distribution  of  is  uniform  over  i. 

It  readily  follows  from  Proposition  2.9.1  that,  in  the  indepen¬ 
dence  case. 


n 


l 

i-1 


n 


i 

i-1 


ni 


(2.9.13) 


S 


n 

In  view  of  (2.9.13),  if  we  let  Z  =  7  4  then  the  exact  as  well 

n  .  ,  ni 

t-1 

n 

as  asymptotic  distributions  of  N(<p")  -  £  can  be  derived  by 

studying  Zn ,  which  is  same  as  the  no.  of  matches  in  the  card  matching 
problem.  As  stated  in  Proposition  2.3.1,  the  asymptotic  distribution 

of  is  Poisson  with  mean  1.  We  now  present  another  proof  of  this 

well  known  result.  The  novel  part  of  our  proof  Is  that  we  establish 

certain  dependence  properties  of  4  , ,  ....  4  and  consequently 

nl  mi 

derive  the  limiting  distribution  by  using  only  the  first  two  moments 

of  Z  . 
n 

Our  program  can  be  stated  as  below: 


( i ) 

Show  that,  4  ,  '  s 
ni 

have 

a 

certain 

positive  dependence 

structure . 

(  ti) 

Invoke  a  theorem 

due 

to 

Newman 

(  1982 )  to  arr i ve  at 

the  Poiss< 

convergence  of  N 

in 

the 

independence  case 

We  start  with  the  definitions  of  some  concepts  of  dependence  of 
random  variables. 
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Def  Wilt  ion  ,'.9.1  (I.ehmann,  1066):  x^  and  x^  arc;  said  to  be  positive 
quadrant  dependent  (PQD)  itT 


P  (  x  ,  >  x 


r  O  "  V  >  P(xi  x  V  ,,<x?  >  V-  v  V  *, 


(  ?  9  m 


Definition  2.0.2  (Newman.  1082):  X|,  ....  xn  are  said  to  be  linearly 

positive  quadrant  dependent.  U.PQD)  iff  for  any  disjoint  subsets  A.B 


of  {!,?,  . ...  n)  and  positive  constants  a 


}  a.  x,  and  )  a.  x,  are  PQD . 
k  k  *  k  k 

kt  A  k(  B 


(  2 . 0  .  1  6  > 


Definition  2.0.1  (Esary,  Proschan,  Walkup,  10b7):  xlf  ....  xn  are 
said  to  be  associated  iff  for  every  choice  of  functions 

f ,  ( x ,  ,  .  ...  x  )  and  f  (x, ,  .  ...  x  ),  which  are  monoton  1c  increasing 

1  1  n  2  l  n 

in  each  argument, 


cov ( f  (  X  ,  ....  X  )  ,  f  ( X  ....  X  )  )  >  0  , 

11  n  2  1  n 


(2.9.16) 


provided  f[fxt,  ....  x  n )  and  f^x^,  ....  xn)  have  finite  variance. 

It  is  well  known  that  association  is  a  stronger  property  than 


I, POD  property  of  n  random  variables  x^, 


x  We  w i 1 1  now 
n 


establish  that  A  ,  .  .  .  ,  A  in  (2.0.0)  possess  a  weaker  version  of 

ril  tin 

t  tie  I.POI)  proper!  y  . 


l.emma  2.0.1:  For  k  1.2 


,  n  1  , 


2  6  ,  and  A  are  PQD. 

ni  nn 


("’.0.17) 


Proof:  Fix  k  1.2,  ....  n  1.  Then,  using  (2.0.10),  we  see  that 


79 


k 

X  4  .  and  A  are  PQD  If 
,  nl  nn 


k  k 

P(  l  A  >  x  ,  4  >  x  )  >  P(  X  A  >  x  )  P ( A  >  x  ),  V  x  ,  x  €R 

ni  1  nn  2  ^  nl  1  nn  2  12 


(2.9.18) 


Because  A  's  are  binary  random  variables  we  obtain 
nl 


1  If  x ^  <  0 


P(A  >  x  )  = 
nn  2 


(2.9.19) 


0  If  x2  >  1 


It  Is  clear  from  (2.9.19)  that  (2.9.18)  holds  for  any  xj,  provided 
x^  <  0  or  x2  >  1.  Hence,  It  suffices  to  show  (2.9.18)  for 

0  <  x_  <  1.  However,  If  0  <  x,  <  1,  then  (A  >  x„)  -  (4  =1). 

~  2  ~  2  nn  2  nn 

It  therefore  remains  to  be  shown  that 


P(  X  A  .  >  l.  A  =  1)  >  P<  l  A  .  >  1)  P(A  =  1), 
nl  ~  nn  _  ^  ^  nl  ~  nn 


V  l  =  0,1 . k. 


By  definition  of  A  , 

nl, 

P( A  ,  =  1)  =  P(R1,  =  1) 
nl  11 

and  P(A  ,  =  0)  =  1  -  - 
nl  n 


1 

n  ’ 


k 


Writing  P(  X  A  ,  >  1)  in  the  form 


(2.9.20) 


(2.9.21) 


80 


P(  1  *  ,  >  4.  A  «  0)  v  P{  l  4  ,  >  t,  4  =  1) 

nn  ^  ^  ni  nn 


and  using  (2.9.21)  we  can  rewrite  (2.9.20)  in  a  more  useful  form: 


k  k 

P(  l  4  .  >  l|4  „  =  0)  <  P(  l  A  >  8, 1 4  =  1), 

ni  -  nn  -  ^  ni  -  nn 


1  =  0,  .  .  .  .  K  • 


(2.9.22) 


Note  that,  in  (2.9.22),  k  is  a  fixed  integer.  For  a  given  k, 

we  now  fix  the  value  of  9.  and  proceed  to  establish  the  inequality 

in  (2.9.22)  by  means  of  a  combinational  argument. 

It  is  clear  that  we  can  express  the  event  (6  =  0)  or 

nn 


as  U  ( R,  =  a).  Hence  we  can  write, 

,  ln 

a=l 


(  l  A  .  >  l,  4  =  0)  =  U  J 

.  ,  ni  -  nn  .  a 

1=1  a=l 


(2.9.23) 


where 


4ni  >  Rln  =  <*>.  a  =  1,2, 


. . ,  n-1 


(2.9.24) 


Observe  that,  in  (2.9.24),  J  's  are  mutually  disjoint  measure 

Ql 

able  subsets  of  ♦.  Let  us  now  fix  a  =  1,2,  ....  n-1  as  well.  Then, 

any  permutation  <<>  in  Ja  satisfies  <p(n)  =  a  and  (ip(l),  <p(n  1)) 

is  an  arrangement  of  the  integers  1,2,  ...,  o(-l,CM-l,  •••.  n  producing 

at  least  !.  matches  of  the  type  <p(i)  =  i  in  the  positions 
i  =  1,2,  ....  k.  On  the  other  hand,  any  permutation  <p  in 


k 

(  y  A  .  >  l,  4  =1)  satisfies  <p(n)  =  n  and 

^  ^  ni  ~  nn 

(<p(l) . <j>(n-l))  is  an  arrangement  of  the  integers  1,2,  ....  n-1 

yielding  at  least  1  matches  such  as  ip(i)  =  i  in  the  positions 
i  =  1,2 . k.  Because  a  *  n,  it  is  clear  that 

k 

#(J  )<//(£  4  .  >  l,  4  =  1)  ,  (2.9.25) 

a  —  _  ni  -  nn 

where  #(A)  denotes  the  cardinality  of  the  set  A. 

Since  a,  k  and  9.  were  arbitrary  choices,  we  get  from  (2.9.23), 

k  k 

#(  l  4  .  >  l.i  =  0)  <  (n-1)  #(  l  A  >  *.A_  =  1) 

^  ^  ni  nn  ^  ^  n±  nn 

k  =  1,2 . n-1;  9.  =  0,  ....  k  (2.9.26) 

Since  R1  is  discrete  uniform  on  $  It  follows  from  (2.9.26)  that 

k  k 

P(  l  A  .  >  1,4  =  0)  <  P (l  4  >  9. , A  =  1)  •  (n-1) 

ni  -  nn  ni  nn 

(2.9.27) 

Multiplying  both  sides  of  the  inequality  in  (2.9.27)  by  n  and  using 
(2.9.21)  we  establish  (2.9.22),  which  implies  that  (2.9.20)  holds.  □ 
We  now  state  two  useful  results  due  to  Newman. 

Lemma  2.9.2  Newman  (1982):  If  and  x^  are  PQD,  then 


*  K.m 


|  E(exp( lrx1  +  isx2 ) )  -  E(exp( irx^ ) )  E(exp(lsx2>| 


<  |  rs  |  covfx^x^  for  all  r,s  e  R 


(2.9.28) 


Lemma  2.9.3  Newman  (1982):  Suppose^that  x  x  are  LPQD .  Then 


I*  x  (r! . r>  -  n  *  (riH  1  1  1  |rkrJ  cov(VV 

1 .  n  j  =  l  j  J  k=l  1  =  1  *• 


k  <  4 


V  r . r  e  R  , 

1  m 


(2.9.29) 


where  'i's  are  given  by 


*  Y  =  E(exp( 1  l  r  x  )) 

1 -  n  J  =  1  3  3 


*x  =  E(exp(  1  r^Xj),  j  =  1,2 . n. 


Suppose  now  that  we  choose  the  arguments  r^ ,  ....  rn  In  (2.9.29) 


equal  to  an  arbitrary  real  number  r,  say.  Assume  further  that 


x  ,  ....  x^  are  exchangeable  random  variables  so  that  they  have 


common  characteristic  function,  namely  't'  (r)  and  that  the  covariance 

X1 


between  any  pair  of  the  x^'s  is  equal  to  covfx^.x^).  It  follows  from 


(2.9.29)  that 


I^IX  (P>  ^x  (r>l  1  lp|2  cov(x1,x2) 


(2.930) 


This  estimate  for  approximating  the  characteristic  function  of  £  x 

1  =  1  1 


<«.  rvv;  ->-■  ;-.v  i  i  .'aj  5 w  ,  w  vvv 


by  the  product  of  the  marginal  characteristic  functions  of  the  x*s 

depends  on  the  fact  that  xl . xn  are  LPQD .  We  now  use  Lemma 

2.9.2  and  show  that,  with  regard  to  the  variables  4nl,  ....  4^, 

an  estimate  similar  to  (2.9.30)  can  be  obtained  under  the  weaker 

version  of  the  LPQD  property  which  Is  given  by  (2.9.17). 

Lemma  2.9.4:  Let  A  ,'s  be  the  Bernoulli  variables  In  (2.9.5)  and 
-  n  ^ 


let  Z  =  X  4 •  Then, 

"  i.i  nl 


|?z  (r>  -  (r) |  < 

n  nl 


|r|  co»(inl.»n2). 


V  n  >  2,  r  GR  , 


(2.9.31) 


Proof:  The  exchangeability  of  4^,  ....  4^  was  established  in 

Proposition  2.9.1.  Hence,  we  obtain 


cov(6  ,4  )  =  cov(4  ,4  ),  V  1  *  J, 

nl  nj  nl  n2 


(2.9.32) 


(r)  E  (r),  V  j, 

4nj  nl 


(2.9.33) 


Note  also  the  well-known  property  that 


I*,  (r) |  <  1,  V  j  and  V  r 

‘nj 


From  Lemma  2.9.1,  we  have 


(2.9.34) 


I  4  ,  and  4  are  PQD,  V  k  -  1,2,  . 
1_1  nl  nn 


.  .  ,  n-1 


In  view  of  the  exchangeability  of  4ni,  ....  4nn,  we  can  restate  this 
property  of  the  4^’s  as  follows: 


84 


s 

V 


Let  A  and  B  be  non-empty  disjoint  subsets  of  {1,2,  ....  n)  such 
that  B  Is  a  singleton .  Then 


X  A  and  £  A  are  PQD 
i€A  nl  leB  ni 


(2.9.35) 


ft 


Fix  n  >  2  and  consider  the  following  finite  sequence  of  statements: 


"  m  ,rl  '  '*4  r  *  i  lrl  cov  4nl *4n2  ' 

J4,  nl 

1  =  1  nl 

V  m  =  2,3 . n  (2.9.36) 

Note  that  (2.9.31)  Is  obtained  from  (2.9.36)  by  letting  m  =  n.  We 

shall  now  establish  (2.9.36)  by  induction  on  m. 

By  choosing  A  =  { 1 > ,  B  =  {2}  in  (2.9.35),  we  find  that  A  ,  and 

nl 

are  PQD.  The  Lemma  2.9.2  readily  Implies  that  (2.9.36)  holds  for 

m  =  2.  Now,  let  us  assume  that  (2.9.36)  holds  for  m  =  2,3,  ....  (n-1) 
n  n-1 

Splitting  £  4  ,  as  the  sum  of  £  A  ,  and  A  ,  we  Infer  the  PQD 

nl  ,  .  nl  nn 


ft 

ft 


1  =  1 


1  =  1 


n-1 


property  of  £  A  and  A  from  (2.9.35).  Hence  we  obtain  again 
nl  nn 

from  Lemma  2.9.2  and  (2.9.32) 


•  (r)  -  V  (r)  •  (r) 

n  n-1  A 

£  A  ,  X  A  .  nn 

iti  nl  i.i  nl 


R 


ft 


£ 


$ 


3 


2  n_l 

<  | r |  cov(  £  A  ,  A  ) 
-  _  ni  nn 


=  | r |  (n-1)  cov(4nl,An2) 


(2.9.37) 


* 


v; 


K 


Now,  we  shall  Invoke  the  induction  hypothesis  that  (2.9.36)  holds  for 
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,*  V  li*  A' 


m  =  n  -  1.  Using  (2.9.33)  to  (2.9.37)  we  finally  establish  (2.9.36) 
for  m  =  n  as  follows: 


I*  n  (r) 


r!  (  d  | 

nl 


<  |f  (r)  -  f  .  (r)  •  f  (r) | 

—  n  n-i  6 

xix  *-  xix  "" 


If  .  (r)  f .  (r)  -  f .  (r) | 

n— 1  6  6  _ 

v  nn  nl 

tix  "l 


<  | r |  ( n-1 )  cov(6nl,An2) 


♦  If  ,  (r)  -  f .  <r) | 

n-i  o 

xix 


<  |  r  | 2  (n-1)  cov<<snl’6n2)  +  lp|2  *  ^  1^n  ^  cov(inl,An2> 


I  r  |  2  cov(6nl,5n2)  (n-1)  [1  +  XL~'\ 


Qiirli  |r|2  co,<*ni-‘n2> 


(2.9.38) 


The  proof  of  (2.9.36)  is  complete  by  our  inductive  argument  and 
(2.9.31)  follows  from  (2.9.38).  □ 


V.V.V.V.’JVV.V  .V.V.V.V.VJ V.-AV.V  .'AV.V.WV.V.V 


frrix^TWfi*  ?*■ 


m 


tjtktkj r*  jt*  vx  i/x  k  v  wiWi'i  v,  v  V7«7,iy  m^vwv 


Our  preparations  so  far  in  this  section  are  adequate  for  the 
purpose  of  establishing  the  Poisson  convergence  of  N  in  the 
independence  case. 

Theorem  2.9.1:  Let  T  and  U  be  independent  random  variables.  Let 
the  number  of  correct  matches,  N,  be  given  by  (2.9.1).  Then 


N  -*  Poisson  (1),  as  n 


Proof:  We  obtain  from  (2.9.13) 


(2.9.39) 


N  ^  2  , 


where  Z^  =  J  4^.  Using  the  exchangeability  of  4nl’s»  we  obtain 


COV(6nl,6n2)  =  P(R11=  1,R12!=  2)  "  [P(R11  =  1)]  (2.9.40) 


Since  P(R11=1 ,R12=2)  =  l/n(n-l),  It  follows  that 


n(n-l)  cov(4  4)  =  -  ,  V  n  >  2, 
nl  n2  n  ~ 


and  therefore 


n(n  1)  cov(  4  ,,4  „)  =  0(1)  as  n-*» 
nl  n2 


(2.9.41) 


The  proof  of  (2.9.39)  consists  of  showing  that  the  characteristic 
function  of  Z converges  to  the  characteristic  function  of  the 
Poisson  distribution  with  mean  1.  In  other  words,  we  shall  show  that 


¥  (r)  -*  exp(exp(lr)  -  1),  V  r  e  R  as  n-**> 

n 


(2.9.42) 


To  this  end,  Lemma  2.9.4  gives  the  following  estimate  of  the 


dependence  case  are  not  available  at  this  time.  Specifically,  no 
proof  of  the  counterpart  of  (2.  <5. 17),  namely 

k 

l  V  and  V  are  PQD  V  k  =  1,2 . n-1,  V  n  >  2 

, ni  nn  — 

(2.9.46) 

is  known.  However,  direct  verification  of  the  association  of 

V  ,  ....  V  has  been  carried  out  for  n=2,3,4  when  T  and  U  have  the 
nl  nn 

Morgenstern  distribution  given  by  (2.6.16).  Since  association  of 
random  variables  is  a  much  stronger  dependent  structure  than 
(2.9.46),  it  is  natural  to  conjecture  that  Lemma  2.9.1  holds  even 
when  T  and  U  are  dependent. 

In  the  absence  of  a  valid  proof  of  Lemma  2.9  1  in  the  depen 

dence  case,  we  need  extra  conditions  on  the  distribution  of  T  and  1) 

in  order  to  derive  the  Poisson  convergence  of  N.  The  following  lemma 

will  be  useful  in  deriving  the  main  result  of  this  section. 

S 

Lemma  2.9.6:  For  a  fixed  d.  let  L  =  —  and  L  =  (L . .  L.)'. 

— —  ~n  n  ~  1  d 

S  and  L  are  defined  in  Section  2.2.  Then, 

~n 


a .  s 

L  -♦  L,  as  n  -»  <*>  (2.9.47) 

~n  ~ 


Proof:  Fix  d  >  1.  It  is  clear  from  the  definitions  of  E,  in 

-  *k 

(2.2.10)  and  the  sigma-field  in  Section  2.2  that  the  infinite 
sequence 


yy.'y 


of  d-dimensional  vectors  are  conditionally  i.l.d  given  A . .  Hence, 

d 

using  the  Strong  Law  of  Large  Numbers  for  exchangeable  sequences 
(Chow  and  Teicher,  p.  223)  we  get 


l  E'WV 


(2.9.48) 


k=d+l 


In  order  to  evaluate  the  limiting  conditional  expectation  in 

(2.9.48),  note  first  that,  for  j  =  1,2,  ....  d,  T  and  U  are 

J  J 

uniform  random  variables.  Now, 


El  WTJ  -  V  ui  •  V 


P(t,  -  T.  .  >  0)  -  P(u.  -  U.  ,  >  0) 
j  d+1  j  d+1  - 


s  V  -  "ViSV 


-  uj ' 


'  V 


(2.9.49) 


Therefore,  it  follows  from  the  definition  of  in  (2.2.10)  and 


(2.9.49) 


E‘WV  ■  (L1’L2 . V- 


(2.9.50) 


Hence,  (2.9.48)  and  (2.9.50)  Imply  that 


l  "  a .  s 

„  l  ,  Jk  -  t'  “s  n  *  - 

k=d+l 


(2.9.51) 


Also,  d  being  a  fixed  integer,  we  have 


V  /-V.  .*  ✓\V_V_V_V 
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a .  s 


“7  ^  ^  -*  2-  as  n  • 

n  a  k=l  K 


(2.9.52) 


Since , 


1  n 

hr,  =  ;  X  II 


-n  n  -*k 
k  =  l 


the  lemma  follows  from  (2.9.51)  and  (2.9.52)  □ 

The  following  sufficient  conditions  will  be  used  to  prove  the  next 
theorem. 

Assumptions :  In  the  notations  of  Section  2.2,  let 


(a)  X  <  <*> 


(2.9.53) 


(b)  J  |*  (0) |  d6  <  » 

-CO 


(2.9.54) 


and  (c)  P ( 4'"  <  t)  =  0(t  )  as  t  -»  oo,  v  d  >  1 
d  ~ 


(2.9.55) 


Theorem  2.9.2:  If  Assumptions  (2.9.53)  to  (2.9.55)  hold,  then 


N  -♦  Poisson  (X)  as  n  -*  <*> 


(2.9.56) 


Proof:  Proof  of  (2.9.56)  consists  in  showing  that  the  factorial 


moments  of  N  converge  to  those  of  the  Poisson  distribution  with  mean 
X,  In  other  words, 


!(N(d))  -»  Xd,  V  d  =  1,2, 


(2.9.57) 


By  the  Fourier  inversion  theorem, 


ft 


% 


z 


*■> 

.  * 

V. 


A 


\9 


_v 
r  * 


A 


8 


P(S  =  0)  =  ( 2ir )  f  ...  J  (©)  d0, 
~n  ~  b>  ~  ~ 

-ir  -if  ~n 


(2.9.58) 


where  ¥  (0)  is  the  characteristic  function  of  the  d  dimensional 

O  *** 

~n 


random  vector  S  defined  in  (2.2.7). 
~n 


The  Assumption  (2.9.54)  ensures  that  the  Fourier  inversion 


theorem  can  be  applied  to  the  continuous  random  variable  L.  Noting 
1 

that  X  =  J  c(x,x)  dx  is  the  value  of  the  density  function  of  L  at  0, 
0 

we  get 


CO 

X  =  gL(0)  =  ( 2tr )  1  f  ¥  (t>  dt 


Since  Lj  =  Tj  -  U j ,  J  -  1,2 . d,  are  i.i.d,  with  their  common  density 


function  equal  to  g  ( . )  it  follows  that 

U 


Xd  =  <2v)  d  J  ...  f  (0)  dg 


(2.9.59) 


—CD  -CD 


Recalling  the  representation 


N  ( <p" )  =  X  I 


i=l  ni 


from  Corollary  2.6.1,  we  obtain 


E(N(d))  =  n(d)  P( A  , A  . . .  A  ) , 

nl  n2  na 


=  n<d)  P(S  =  0) , 
~n  ~ 


( ? . 9 . 60 ) 
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i Vl*«  «S  V. 


A 


where  n  =n(n-  1)  ...  (n  -  d  +  1). 


For  fixed  d.  It  is  clear  that  s;  nd  as  n  -»  oo.  It  therefore 

follows  from  (2.9.60)  that,  In  order  to  prove  (2.9.57),  it  Is 
sufficient  to  show  that 


Lim  | A(d,n) |  =0, 


where  A(d,n)  =  ndP(S  =0)  -  Xd 

~n  ~ 


From  (2.9.58)  and  (2.9.59),  we  obtain 


ir  ir  oo  oo 

A(d,n)  =  nd(2ir)“d  \  ...  J  ¥  (u)du-(2ir)~d  J  ...  J  ^(BJdQ 

-ir  -ir  ~n  -oo  -oo  ~ 


(2.9.61) 


(2.9.62) 


On  making  the  change  of  variables  0  =  (m^,  ....  nu^)  in  the 
first  term  of  (2.9.62)  and  noting  that 


*s  (g/n)  =  '?L  (9)  ,  we  get 
~n  ~n 


nir  nir  oo  a> 

A ( d , n )  =  (2ir)~d  J  ...  I  ¥  (0)d0  -  \  ...  J  V  (0)d0 

-nir  -nir  ~n  -oo  -oo  ~ 


(2.9.63) 


For  positive  constants  a  and  B,  which  will  be  determined 
later,  define  four  integrals  as  follows: 


(1)  J,  =  -  I  ...  I  (0)  d0 
1  1 6 1  ><*  k  ~ 


(2.9.64) 


'i".  it,  (*.1 1  j-i  ^  *  i  j  *  *.*  i,*  *.< 


l.l 


AS  l>«  V< 


(il)  J.(n)  =  l  ...  J  (¥.  (6)  -*T(0)]d0 
1 0 1 <a  ~n  ~ 


(2.9.65) 


(ill)  J3<n)  =  J  ...  J  ¥  (0)d6 

.6,  ~n 

-< \z  <b 

n~  n 


(2.9.66) 


(iv)  J  (n)  =  l  ■■■  I  V.  (0)d0 


(2.9.67) 


Bn<  0  <irn  ~n 


It  Is  easy  to  verify  using  these  integrals  and  (2.9.62)  that 


A(d,n)  =  (2tr)  X  Jt 


(2.9.68) 


For  appropriate  choices  of  a.  and  B,  we  will  show  that 


| ( n )  1  -♦  0  as  n  -*  ®,  k  =  1,2, 3, 4, 


which  will  imply  (2.9.61). 


Let  c  >  0  be  a  fixed  number.  Then,  assumption  (2.9.53)  and  the 


expression  (2.9.59)  imply  that  ¥^(0)  is  absolutely  integrable 


on  R  .  Therefore,  we  can  find  a  large  enough  o  such  that 


U, I  <  J  ...  f  I*.  (©) Id© 

1  | 0 | >a  ^ 


<  c/4 


(2.9.69) 


From  Lemma  2.9.5,  we  have 


a.  s 

L  -*  L  , 
~n 


which  implies  that  (cf.  Bhattacharya  and  Ranga  Rao,  1976,  p.44) 


*l  •*  »L(g)  as  n  ->  », 

~n  ~ 


•  v 


the  convergence  being  uniform  on  the  compact  subset 


(e-.eeR  and  |e|  <  a} 


Hence,  for  the  a  chosen  above,  we  can  find  n^  such  that 
V  n  >  n^ , 


I J2(n) |  <  c/4 


In  order  to  show  that  |  J3 ( n )  |  -*  0,  we  transform  0  to 
r  =  9/n  in  J3  and  obtain 


J3 ( n)  =  n  J  . . .  J  ¥s  (r)dr 
all.  ~n 

s<IeI<b 


Note  that  S  =  Y  L  is  a  lattice  random  vector  so  all  its 
~n  *•  *1 

1  =  1 

T1 

moments  exist.  Since  (^  )  are  1.1. d,  It 
follows  from  the  definition  of  ^  in  (2.2.10)  that 


E(  S  )  =  0 
~n  ~ 


(2.9.70) 


(2.9.71) 


(2.9.72) 


It  was  argued  In  the  proof  of  Lemma  2.9.5  that,  for  all  n  >  d, 
. are  conditionally  1.1. d  given  Ad  with  mean 


E(|j |Ad)  =  L.  V  j  =  d+1,  . . . ,  n 


It  is  easy  to  verify  that  the  dispersion  matrices  D(£j|Ad> 

j  =  d+1,  ...,  n,  are  positive  definite.  Moreover,  for 

J  =  1,2,  ...  d,  Is  degenerate  given  A  and 
3  Q 


D(L)  =  a I, 


where  a  =  var(T-U)  and  I  Is  the  dxd  Identity  matrix. 


The  dispersion  matrix  of  §n  is,  for  n  >  d. 


(2.9.73) 


D<S  >  =  D(  l  l) 
1=1 


=  E(D(  l  l  | A  ))  +  D(E(  l  l  |A  )) 
1=1  i=l 


(n-d)  ED(|^  |A  )  v  (n-d)  D(L) 


We  finally  conclude  that 


D(Sn)  -  (n-d)  o  I  «  (n-d)  ED(£d+1|A) 


(2.9.74) 


is  positive  definite. 


As  the  second-order  moments  of  S  exist,  we  expand  t  (r)  around 

~n  o 

~n 

r=0  and  using  (2.9.72)  obtain 


log  *Sn(r)  - 


\  r'D(S  )r  +  0 ( ||r|| 2 )  ,  as  ||r||  -»  0 
c  ~  ~n  ~  ~  ~ 


(2.9.75) 


In  view  of  (2.9.73),  we  obtain 


|exp(log*s  ( r)  )  |  <  exp(  -  a2l|r||2  «■  0||r(| 2 )  . 

~n 


as  ||r ||  >  0 


i 

& 


t  <vyV*".  •»  »  .  • ",  •f',  «r,v  «  *'m  •  .  <  ,  •  ,  r.  »'  »■ .  *•  »*  ,■  .*  ,•  -  ,•  *, 


.M'i'ri's'i't.l'I.iI.rl.lLl'M'l.l'I.X.l'UIUlU 


Hence,  there  exists  a  constant  6  >  0  such  that  for  n  >  d. 


|*s  (r)  |  <  exp{ -  ^  (n-d)2  a2  ||r||2), 

~n 


V  r  <  B 


Now,  3  n„  such  that  V  >  n„ ,  ~  <  B  so  that  we  obtain  using 
2  —  2  n 

(2.9.72)  and  (2.9.76) 


I J3  ( n )  1  <  nd  J  ...  J  exp(  -  ^  (n-d)2  a 2  ||r||2)  dr 


(2.9.76) 


-  < I r | <B 
n  x 


<  I  •  •  •  \  exp(-  \  a2  ||r||2)  dr 
|e|  >  a 


(2.9.77) 


It  is  clear  that  we  can  choose  a  large  enough  a  In  (2.9.77)  such 


that  V  n  >  n^. 


| J3 ( n ) |  <  e/4. 


(2.9.78) 


Finally,  to  show  that  |J^|  -»  0,  we  transform  u  =  9/n  In  (2.9.67) 


and  obtain 


| J  ( n ) |  <  n  J  I*  <u) |  du 

B<  |  u  j  <tr  ~n 


(2.9.79) 


In  view  of  the  earlier  remarks  about  the  conditional  distributions 

of  L,  ....  I  given  AJ  ,  we  obtain  for  n  >  d, 

-'l  -in  a  ~ 


'V-11  1«-1 . “ 


(2.9.80) 


Where  ^d  +  l  =  £d+l(V 


,  w  )  is  the  value  of  L  ,  given 
~a  1 


=  (T^,U^),  i  =  1,2,  ...,  d.  Since  the  characteristic  function 


¥  (u)  is  uniformly  continuous  on  the  compact  set 

W  ~  d 

{u:  6  <  | u |  <  *}  of  R  ,  it  attains  its  maximum  inside  this  set,  say 


at  u  =  uM .  Furthermore,  V  has  period  2ir  so  that,  for  almost 

*d+l 


all  realizations  (w,  ,  ....  «J, 

~l  ~a 


sup  I'f  (u)  I  <1 
B<|u|<ir  ^d+1 


(2.9.81) 


Letting  =  -  dnt'F  (u*)],  we  get  from  (2.9.79)  and  (2.9.80), 
°  M+l 


|J4I  <  nd  E  d(exp(-(n-d)^) 


(2.9.82) 


=  n  M^.fn-d) 
Td 


where 


CD  CO 


M(s)  =  I  ...  f  exp(-s'F-)  n  dC(x  ,y  ) 
0  0  J  J 


(2.9.83) 


is  the  moment  generating  function  of  V*  with  a  real  positive 


argument . 


Now,  using  the  Abelian  Theorem  (cf.  Widder  (1941),  p.  181),  we 


obtain 


p(*a<t) 

Llm  sup  t  M^,(t)  <  Lim  sup[  r(d+l)] 

t-*a>  tiO 


(2.9.84) 


By  Assumption  (2.9.55),  the  right-hand  side  of  (2.9.84)  is  zero  and 


it  follows  that 


to 


1 


tlon  with  marginal  distributions  F(t)  and  G(u)  then  complete  sets  of 
orthonormal  functions  i  =  1,2,  ....  can  be  defined  on  the 

marginal  distributions  such  that 

CD 

dH(t.u)  =  [1  +  l  Pl  nn(t)  n21(u)l  dF(t)  dG(u)  (2.9.87) 

CD 

and  <J>2  =  l  2  (2.9.88) 

i=l  1 

It  may  be  recalled  from  (2.6.12)  that,  when  all  p^  >  0  in  the  above 

canonical  expansion  of  the  joint  distribution  of  T  and  U,  we  say  T 

and  U  are  positive  dependent  by  expansion  ( PDE) .  It  follows  from 

2 

(2.9.87)  that,  when  a  copula  C(t,u)  is  4>  -bounded,  X.  in  (2.9.63) 
can  be  evaluated  using  the  orthonormality  of  [ti^ }  as 

1 

X  =  J  c(x,x)dx 
0 

CD 

=  l  *  l  P ,  (2.9.89) 

i  =  l 

It  follows  from  (2.9.88)  and  (2.9.89)  that  the  finiteness  of  4>2  and 
X  are  related  to  each  other.  Specifically,  since  V  1  >  1, 
the  canonical  correlations  p^  <  1,  we  obtain 

2 

X  <  <*>  =>  4>  <  oo 


With  regard  to  the  Morgenstern  distribution  in  (2.6.16),  we  obtain 


100 


1“ 


If  1=1 


0  If  1>1 


where  -l<a<l.  However,  we  have 


1 

X  =  J  c ( x , x )dx 
0 


which  Is  finite.  Similarly,  In  the  bivariate  normal  distribution 
given  by  (2.6.1b), 

X  =  r1-  .  0  <  P  <  1 


In  view  of  these  examples,  assumption  (2.9.53)  Is  not  vacuous. 

Bhattacharya  and  Ranga  Rao  (1976)  (pp.  189-192),  gives  conditions 
that  are  equivalent  to  the  assumption  (2.9.54).  We  cite  one  here: 

Let  G  m  denote  the  nth  convolution  of  the  distribution  of 

L» 

"m 

L  T  -  U,  where  m  >  1.  If  there  exists  an  integer  m  such  that  G^ 

has  a  bounded  (almost  everywhere)  density,  then  the  modulus  of  the 

characteristic  function  of  L  is  integrable  on  ( -®,°o) ( that  Is 

assumption  (2.4.54)  Is  valid)  and  vice  versa. 

Another  sufficient  condition  for  absolute  Integrabil Ity  of 

4*  (0)  Is  due  to  Bochner  and  Chandrasekar  (1949).  If  there  exists 
L 

a  bounded  (almost  everywhere  density  g  (t)  of  L  =  T  -  U  and  if  its 

L* 


characteristic  function  ¥  (9)  is  (real)  and  nonnegative,  then 


101 


J  l'i'L(©)  |  d6  <  «*>  . 


We  illustrate  the  use  of  this  sufficient  (but  not  a  necessary) 


condition  when  (jj)  has  the  Morgenstern  PDF, 


C(x,y)  =  1  +  a  (1  -  2x) ( 1  -  2y ) 


Clearly,  as  |a|  <  1,  |x|  <  1,  |y|  <  1,  3  a  positive  constant  k 


such  that 


C(x,y)  <  k,  V  (x,y)  c[0,l]> 


Note  that 


g. (t)  =  J  z(t+y ,y)dy,  V  t  >  0 
L  y=o 


By  the  symmetry  of  C(x,y)  in  and,  it  can  be  shown  that 


gL<-t)  =  V  t  >  0. 

Now,  using  the  bound  k  for  C(x,y),  and  the  fact  that  [-1,1]  is 


the  support  of  L,  we  get 


g,  (t)  <  k  J  dy  <  2k  < 
L  0 


Hence,  it  follows  that  the  PDF  of  L  is  (almost  everywhere)  bounded. 


We  now  show  that  t  (0)  is  real  and  nonnegative  V  a  >  0 

u 


i  (T-ine 

»L(0)  =  E(e  )  =  Ix  ♦  al 2 


.  _  f  r  i ( x-y  )0 

where,  I  =  J  J  e  dxdy 

1  0  0 
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r  J 


Z1  zl* 


with  z  =  J  e  ix9dx 
1  0 


I  =  I  I  e1(x  y)0  (l-2x) (l-2y)dxdy 
2  0  0 


=  z„  z„  , 
2  2 


with  Z  =  I  e  (l-2x)dx 
2  0 


Hence.  4^(0)  =  I ZT  ( 0 )  |  2  +  a|Z2<0)|2  >  0  If  q  >  0. 


Invoking  Bochner's  sufficient  condition,  we  get  J  1^(0)  |d0  <  “  , 

— <x> 

If  a  >  o.  However,  for  all  a. 


UO  w  w 

\  |*L(e)|de  =  \  I z  1  ( o )  1 2  de  4-  a  J  I z2 (0)  | 


(2.9.90) 


so  that  the  two  Integrals  on  the  right  hand  side  must  be  finite  when 

OO 

a  >  0.  It  follows  that,  even  when  a  <  0,  J  14^(0)  |d0  <  <®.  We  con- 

-OO 

elude  that  (2.9.54)  is  valid  for  any  member  of  the  Morgenstern  family 

of  densities.  It  may  be  remarked,  in  passing,  that,  in  view  of  the 

generality  of  the  conditions  of  Bhattacharya  and  Ranga  Rao  (1976)  and 

Bochner  and  Chandrasekar  (1949).  (2.9.54)  holds  for  many  distribu- 

T 

t  ions  of  (it)  . 


N 

N 


V 

*• .  r* 


V"  C 


Lastly,  we  discuss  the  validity  of  (2.9.55).  To  be  specific, 


when  d=l,  one  can  get  the  bound 


(W)(6)l  <  1  ~  P0<l-P0)  ♦  sin2(B/2)  V  B  <  6  <  w,  w  =  (*) 


where  p  =p(w)=l-x-y+2C(x,y) 
o  o  ~ 


Therefore, 


| J  (n,B) |  <  }  J  n  e-(n-l)4sin26[P0(l-P0)]  dxdy 
4  0  0 


Thus,  -*  0  as  n  -♦  °°  if  we  show  that  nMp  p  ^(n  )  •*  0  as 

o '  o ; 

n  -»  <=,  where  M  (s)  is  the  Laplace  transform  of  n-  A  sufficient 
n 


condition  for  this  to  happen  is 


P(P  (1-P  )  <  t)  =  0(t)  ,  as  t  -+  0 
o  o  — 


Let  6 ( t )  and  l-6(t)  be  the  roots  of  the  equation 


P  (1-P  )  =  t 
o  o 


(2.9.91) 


It  suffices  to  show,  as  t  -*  0, 


P(Pq  <  6 ( t ) )  =  0(t)  and 


P(Pq  >  1  -  <5 ( t ) )  =  0(t) 


(2.9.92) 


(2.9.93) 


If  ((j)  is  independent,  then  the  PDF  of  P  can  be  shown  to  be 


g  (x)  =  -!n( | 1-2x1 )I(x) 
o  [0,1] 


So  that  (2.9.92)  and  (2.9.93)  are  valid  when  C(x,y)  =  C  where 

o 

C  (x,y)  =  xy.  Also,  If  C(x,y)  >  xy,  then  P  (C)  >  P  (C  )  so  that 
O  —  O  ~  o  o 

P(P  (C)  <  6(t))  <  P(P  (C  )  <  6 ( t ) )  (2.9.94) 

o  —  ~  o  o  - 

Thus,  using  the  exact  calculations  based  on  the  Independence  case. 

It  follows  that 

V  C  >  xy,  P(P  (C)  <  6 ( t ) )  =  0 ( t ) 

T 

At  this  time,  we  are  optimistically  speculating  that,  when  (y)  are 
dependent,  (2.9.93)  Is  also  true.  We  are  yet  to  demonstrate  that 
the  assumption  (2.9.55)  is  not  vacuous  for  any  d  >  1. 

After  we  derived  the  proof  of  Theorem  2.9-2,  we  discussed  the 
Poisson  convergence  problem  with  Professor  Persl  Dlaconis,  who 
communicated  the  problem  to  Professor  Charles  Stein.  In  his  Neyman 
lecture  at  the  IMS  Annual  (1984)  meeting.  Professor  Stein  outlined 
an  alternative  proof  of  the  Poisson  convergence  using  his  well  known 
theorem  concerning  the  approximation  of  probabilities.  However,  we 
have  not  seen  any  rigorous  version  of  the  proof  yet. 
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3.  MERGING  FILES  OF  DATA  OH  SIMILAR  INDIVIDUALS 


Problems  of  statistical  matching  were  discussed  in  Chapter  2, 
where  we  assumed  that  the  two  micro- data  flies  being  matched  cons  Is 
ted  of  the  same  individuals.  Moreover,  the  files  did  not  have  any 
common  matching  variables.  In  Chapter  1,  practical  and  legal  reasons 
were  cited  for  these  assumptions  not  to  hold  In  certain  situations. 
Suppose,  then,  we  have  two  files  of  data  that  pertain  to  similar 
individuals.  Allowing  for  some  matching  variables  to  be  observed 
for  each  unit  In  the  two  files,  we  seek  to  merge  the  files  so  that 
Inference  problems  relating  to  the  variables  not  present  in  the  same 
file  can  be  addressed.  This  scenario  was  labeled  Case  III  in 
Section  1.  In  this  chapter, we  shall  first  review  the  existing 
literature  on  Case  III,  and  then  briefly  discuss  some  alternatives 
to  matching  In  certain  models  in  which  the  non-matching  variables 
are  conditionally  independent  given  the  values  of  the  matching 
variables.  Finally,  we  will  present  the  results  of  a  Monte-Carlo 
study  carried  out  to  evaluate  certain  matching  procedures  relevant 
to  Case  III. 


I.f 


A' 


3.1  Kadane’s  Matching  Strategies  for 


Multivariate  Normal  Models 


Distance  based  matching  strategies  were  Introduced  In  Section 
1.5.  The  choice  of  distance  measures  In  the  matching  methodology  can 


be  motivated  using  a  model  where  the  unobserved  triplet  V  =  (X,Y,Z) 
has  a  multivariate  normal  distribution.  The  set-up  of  the  two  flies 
to  be  merged  is  as  follows: 

File  1  comprises  a  random  sample  of  size  n1  on  (X,Z),  while  File 
2  consists  of  a  random  sample  of  size  n^  on  (Y,Z) .  Furthermore,  we 
expect  very  few  or  no  records  In  the  two  flies  to  correspond  to  the 
same  Individuals.  Statistically,  this  means  that,  for  all  practical 
purposes,  the  two  random  samples  are  themselves  Independent .  For 


this  reason,  we  shall  denote  the  sample  data  as  follows. 


(Base)  File  1: 


(Xi,Z1),  1  —  1,2,  * • « ,  n^ 


(3.1.1) 


(Supplementary)  File  2:  (Y^.Z^),  j  =  n^+1,  ...,  n1^n2 


Once  finished,  the  matching  process  leads  to  more  comprehensive 
synthetic  files,  namely 


Synthetic  File  1:  (X^Y-.Z^,  1  =  1,  2 . n1 


Synthetic  File  2:  (X-,Y  ,Zj),  j  =  n^l . nl+n2 


(3.1.2) 


where,  Y*  Is  an  Imputed  value  of  Y  that  comes  from  the  original  File 
2  and  X"  Is  an  imputed  value  of  X  that  Is  taken  from  the  original 
File  1  by  means  of  some  matching  strategy.  We  shall  now  review 


Kadane  (1978)'s  development  of  the  matching  methodology  for  a  multi¬ 
variate  normal  model. 


Suppose  that  W  =  (X,Y,Z)  has  a  multivariate  normal  distribution 


X  -  X 
X 


wv 

and 

variance- covariance  matrix 

XX 

l 

yx 

l 

yy 

(3.1.3) 

zx 

l 

^zy 

N 

N 

1 _ 

The  parameters  X„„ . X„„ . X„„ . X„„ , X„„  can  all  be  estimated  consis- 
xx  xz  yy  yz  zz 

tently  using  the  marginal  information  on  (X.Z)  and  (Y,Z)  respectively 

in  the  two  files.  However,  £  is  an  unidentified  parameter,  because 

xy 

the  joint  likelihood  of  the  data  on  (X,Z)  and  (Y,Z)  is  free  of  the 
matrix  £xy.  In  fact,  in  the  domain  in  which  Xxy  is  such  that  the 

7  y  1 

matrix  xx  xy  is  positive  semldef lnlte,  nothing  is  learned 

X  X 

yx  yy 

from  the  data  about  £xy,  except  in  a  Bayesian  framework,  where  £xy, 

£  ,£  are,  a  priori,  dependent.  Even  in  this  situation,  the 

xz  yz 

posterior  dlstrlbuion  of  £  is  updated  only  through  £  and  £ 

xy  xz  yz 

Kadane* s  approach  to  merging  File  1  and  File  2  consists  of  the 
following  steps: 

(i)  Start  with  an  imputed  value  of  £  via  some  a  priori  distribu¬ 
tion  on  the  covariance  matrix  £,  (ii)  Complete  Files  1  and  2  by 
predicting  the  missing  data,  X  or  Y,  using  the  marginal  information 
in  the  files,  (iii)  Match  these  "completed"  files  based  on  a 


distance  measure  between  records  of  the  two  flies,  (iv)  Estimate 
parameters  such  as 


Y  =  J  g(w)  dF(w)  , 


(3.1.4) 


using  the  synthetic  file  resulting  from  Step  (111)  and  repeating  the 

Steps  (11)  through  (lv)  many  times  to  find  the  sensitivity  of  the 

estimates  to  the  imputed  value  of  £  and  finally  weight  the  results 

xy 

using  the  a  priori  distribution  on  £. 

Some  further  details  of  the  steps  outlined  above  are  as  follows: 

Suppose  that  a  an  Imputed  value  of  £  is  available.  Then  we  can 

xy 

assume  that  7Xy  Is  known  and  complete  the  two  flies  by  means  of  condi¬ 
tional  expectations.  Let  Xab  c.  for  any  letters  a,  b  and  c,  be  given 


7  =  7  -7  X-1  7 

6ab.c  ab  ^ac  ^cc  ‘"cb 


Then  the  predicted  value  Y,  say,  of  a  missing  Y  In  File  1  Is  given  by 
Y  =  E(Y|X.Z) 

=  u  +  7  l'1  (X-u  )  +  l  l~l  (Z-u  ),  (3.1.5) 

"y  yx.z  xx. z  ~  yz.x  zz.x  ~  rz 

Similarly,  the  predicted  value,  X,  say,  of  a  missing  X  in  File  2  is 
given  by 

X  =  E(X | Y ,Z) 


=  v  ♦  I  l  1  (Y-u  )  ♦  l  l  1  (Z-u  ) 
Kx  '"xy.z  yy.z  ~  Ky  xy.y  ‘'zz.y  ~  cz 


(3.1.6) 


Using  (3.1.3),  (3.1.5)  and  (3.1.6),  It  Is  now  easy  to  show  that 

(X, ,Y, ,Z. )  Is  multivariate  normal  with  mean  vector  (u  ,u  ,u  )  and 

~1  ~i  ~1  fcx  *-y  *-z 

variance-covariance  matrix 


WWW 


A*  l 
1  > 


a_  a: 

3  2 


^zx  A2  ^zz 


(3.1.7) 


where 


a  =  X  Z  1  Z  +  Z  z1  Z 

1  ^yx.z  ^xx.z  ^xx  ‘•yz.x  ^zz.x  / 


A  V  yl  y  y  yl  r 

a2  -  ^zx  ^xx.z  ^xy.z  +  ^zz  ^zz.x  ^zy.x 


A,  =  z  Z"1  z  z  1  z 

3  yx.z  xx. z  xx  ‘xx.z  xy.z 


V  r  ^  r  r-  1  Y 

+  ^yz.x  ^zz.x  **zz  ^zz.x  ^zy.x 

♦  2Z  Z_1  Z  Z_1  Z 

'•yx.z  ^xx.z  ^xz  ^zz.x  czy.x 


Also,  the  vectors  (Xj  .Yj  *Z-j )  •  3  =  n1+1»  •••*  n^yn^  have  a  comm°n 
multivariate  normal  distribution  with  mean  vector  ( Hx • Ky • Hz *  and 
variance-covariance  matrix 


where 


6  ^zy 

V-  1  r 


,-l 

z 

yyz 

yx , 

.  z 

l 

Z"1 

zz .  y 

zz 

zz  .y 

r-1 

r 

r-1 

‘xy.z  yy  z  ‘■yz  ‘■zz.y  °zx.y 


(3.1.8) 


V.V.VA'N^.WjV 


r 


A  r  r  1  y  ^  y  y"l  y 

n5  ~  ^yy  ^yy.z  ^yx.z  +  ^yz  ^zz.y  ^zx.y 


A  —  y  y-1  y  ^  y  y"l  y 

a6  '  ^zy  ^yy.z  ‘yx.z  +  ^zz  ^zz.y  ^zx.y 


Note  that  the  distributions  given  by  (3.1.7)  and  (3.1.8)  are  singular 

because  the  predicted  values  Y.  and  X,  are  linear  functions  of  the 

~i 

other  components  of  the  random  vectors  =  (X^.Y^.Z^)  and 

U  =  (X,  , Y .  ,ZA  )  respectively,  where  i  -  1,2 . n  and 

~>n1  ~J+n1  ~j>n1  K  1 

j  =  1,2 . n^ .  In  order  to  describe  Kadane's  procedures  to  match 

the  completed  File  1,  namely,  T  ,  ....  T  with  the  completed  File  2, 

~ni 

namely,  U  ,  ....  U  ,  let  us  first  assume,  for  simplicity,  that 
~1  ~nl 

ni=n2  n’  Starting  with  n  records  in  each  file,  we  will  compute  the 
dlf f erences 


i-  X  -  X, 

~i  -j+n 


ii  -  yj 


z  J 

~J  +  n 


,  1  <  i,  j  <  n 


(3.1.9) 


in  order  to  define  a  measure  of  dissimilarity  between  any  pair  of 
records,  one  each  from  the  two  completed  files.  Suppose  first  that, 
there  exists  a  vector  of  constants  l  =  (l  ,  ....  in)’,  say,  and  1  and 
j  such  that 


p(i’(T1  -  U^)  =  0)  =  1. 


(3.1.10) 


In  view  of  the  independence  of  the  random  vectors  and  ,  it  is  clear 
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that  (3.1.10)  cannot  hold.  Consequently,  any  of  the  vectors 
Is  free  of  any  linear  relationship  among  Its  components.  It  follows 
from  this  fact  and  (3.1.7)  to  (3.1.9)  that  the  differences  -  0 ^ , 

1  <  1 ,  J  <  n  are  Identically  distributed,  each  with  a  nonslngluar 
multivariate  normal  distribution  with  mean  0  and  variance-covariance 
matrix  .  For  any  positive  definite  matrix  A,  a  dissimi¬ 

larity  measure  between  and  0^  can  be  defined  by  the  quadratic 
form 


VA)  ■  (Ii  -  V,A(*i  -  v- 


(3.1.11) 


Also,  d^j(A)  will  be  referred  to  as  the  distance  between  the  ith  record 
of  File  1  and  the  jth  record  of  File  2.  Various  choices  of  A  in 
(3.1.11)  provide  different  distance  measures. 

It  may  be  recalled  from  Section  1.5  that  a  constrained  matching 
of  the  two  flies  Is  obtained  by  minimizing 


n  n 


C  =  X  1  dijalj 
1=1  j  =  l  J  J 


subject  to  the  conditions 


(3.1.12) 


1  -  1,  V  1  -  1,2,  ....  n 

j  =  l  J 


n 

l  a 
1  =  1 


ij 


and 


=  1.  V  j  =  1,2 . n 


=  0  or  1 ,  V  1  and  J 


(3.1.13) 


(3.1.14) 


(3.1.15) 
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If  the  d^'s  in  (3.1.12)  are  given  by  d  (A)'s  in  (3.1.11)  for  some 
choice  of  A,  then  we  obtain  an  optimal  distance- based  constrained 
match.  Note  that  this  type  of  matching  of  the  two  files  amounts  to 
solving  a  linear  assignment  problem.  Sometimes,  an  optimal  matching 
may  be  obtained  by  minimizing  (3.1.12)  without  requiring  that  the 
conditions  (3.1.13)  and  (3.1.14)  hold.  However,  as  reported  in 
Rodgers  (1984),  unconstrained  optimal  matches  do  not  provide  good 
estimates  of  the  distribution  V  =  (X.Y.Z).  We  shall  not  discuss 
such  "unconstrained  matchings." 

It  is  important  to  note  that  the  aforementioned  optimization 

problem  needs  to  be  solved  for  each  realization  of  the  random 

variables  involved.  Suppose  then  that  T  and  U  have  been  matched 

A  J 

in  a  given  problem.  Then  it  might  be  natural  to  take  (X^Y  ,2^)  and 
(X^.Yj.Zj)  as  simulations  of  the  underlying  distribution.  Now,  the 
parameter  y  in  (3.1.4)  can  be  estimated  using  one  of  the  following 
syrit  tier  ir  samples  : 


Synthetic  File  1:  (X^,Y",Z^),  i  =  1,2,  ....  n. 


(3.1.16) 


Synthetic  File  ?:  (X^.Y^.Z  ),  J  =  n*l,  ....  2n . 


(3.1.17) 


where  Y*  and  X”  are  values  given  by  the  matching  procedure. 

Kadane  has  suggested  that  matchings  based  on  a  fixed  A  in 
(3.1.11)  and  the  consequent  inferences  based  on  synthetic  files  such 
as  (3.1.16)  or  (3.1.17)  must  be  repeated  many  times  and  the  results 
must  be  averaged  in  some  sensible  way  in  order  to  explore  the  sensi¬ 


tivity  of  our  findings  to  the  value  of  J  we  started  with.  We  shall 

xy 


not  pursue  such  Issues  as  the  actual  choice  of  a  prior  on  £  and  the 
aforementioned  sensitivity  studies  of  inferences  based  on  synthetic 
data.  However,  we  shall  now  discuss  Kadane's  choices  of  the  matrix 
A,  which  will  be  used  in  our  Monte-Carlo  Study  of  Section  3.3. 

Kadane  has  advocated  two  choices  for  the  matrix  A  in  the  deflnl 
tion  of  distance  measure  d^  ,  which  is  given  by  (3.1.11): 


(i)  A  =  (Q1  +■  Q^)  , 


(3.1.18) 


where  and  Q2  are  the  matrices  in  (3.1.7)  and  (3.1.8);  this  A  leads 
to  the  so-called  Mahalanobls  distance  between  the  records  of  the  two 
files,  and 


(il)  A  *  0 


(3.1.19) 


In  general,  the  relative  benefits  of  these  two  distance  measures 
is  an  open  question,  although  the  empirical  studies  of  Barr  et  al 
(1982)  and  other  investigators  reported  in  Rodgers  (1984)  indicate 
that  the  Mahalanobls  distance  is  worse  than  the  distance  provided  by 
(3.1.19)  in  the  sense  of  distorting  the  bivariate  and  multivariate 
relationships  among  the  variables  X,  Y  and  Z.  In  view  of  this,  we 
shall  follow  Kadane  (1978)  in  calling  the  measure  induced  by  (3.1.19) 
the  "blas-advoiding  distance  function.”  The  special  case  of  (3.1.19) 
when  Z  has  only  one  component  will  be  discussed  in  the  next 
subsection . 
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3.1.1  Isotonic  Matching  Strategy 


We  shall  evaluate,  In  Section  3.3,  Kadane’s  matching  strategies 
in  the  simple  case  when  the  triple  U  =  (X,Y,Z)  has  a  trivariate 
normal  distribution.  In  order  to  facilitate  such  evaluations,  we 
now  show  that,  in  the  special  case  of  a  scalar  Z,  the  matching 
strategy  based  on  (3.1.19)  can  be  implemented  without  using  any 
algorithm  to  minimize  distances. 

Assuming  that  Z  is  scalar  and  using  (3.1.19)  in  the  objective 
function  given  by  (3.1.12),  C  is  equivalent  to 


C 


n  n 

l  l 

1--1  j  =  l 


(Z 


11 


v2  aij 


(3.1.20) 


In  a  constrained  match,  ajj's  are  subject  to  the  conditions  (3.1.13) 
to  (3.1.15)  •  Thus,  (3.1.20)  further  simplifies  to 


C  = 


n 

l 


n 

l 


'2j 


n 

l 


n 

l 


1-1  J-l 


ZllZ2JaU 


Hence,  the  minimization  of  distances  reduces  to  maximizing 


n  n 


C '  = 


A  aijZliZ2j 


(3.1.21) 


1=1  J=1 


subject  to  the  conditions  (3.1.13)  to  (3.1.15)  on  the  a^j's. 


DeGroot  and  Goel  (1976)  show  that,  given  the  numbers  z^’s  and 


z  *s,  the  constrained  maximization  of  C'  is  equivalent  to  maximizing 


n 

£  zuz2  over  all  permutations  (*>  of  the  Integers 
1,2,  ....  n.  However,  this  latter  extremal  problem  was  encountered 


In  Section  2.4  when  we  derived  the  M.L.P  for  certain  bivariate 
matching  problems.  It  follows  that,  with  regard  to  Kadane's  distance 
measure  given  by  (3.1.19),  where  Z  is  scalar,  the  optimal  matching 
strategy  is  to  order  the  Z-values  in  the  two  files  separately  and 
then  match  the  ith  largest  Z  in  File  1  with  the  ith  largest  Z  in 
File  2.  This  explicit  solution  means  that,  if  Kadane’s  matrix  in 
equation  (3.1.19)  is  used  to  minimize  distances  between  records  of 
the  two  files,  then  the  synthetic  File  1  is  obtained  by  matching  the 
the  X  concomitant  of  the  1th  order- stat istic  among  Z’s  in  File  1  with 
the  Y-concomitant  of  the  ith  order  statistic  amont  Z's  in  File  2. 

We  shall  refer  to  this  strategy  as  isotonic  matching  of  the  two  files 
because  the  matching  procedure  is  determined  by  the  order  statistics 
of  the  Z's  in  File  1  and  the  order- statistics  of  the  Z’s  in  File  2. 

3.1.2  Sims'  Matching  Strategy 

In  the  preceding  subsection,  it  was  shown  that  one  of  Kadane’s 
matching  strategies  can  be  simplified  to  the  point  of  not  using  any 
optimization  algorithm  in  the  matching  procedure.  Such  simplifica¬ 
tion  is  clearly  not  possible  when  the  triple  (X,Y,Z)  has  a  multi 
dimensional  Z  .  The  whole  idea  of  generating  very  large  synthetic 
data  sets  by  actually  minimizing  a  sum  of  distances  over  all 
potential  matches  seems  computationally  profligate.  One  possible 
alternative  to  distance  based  strategies,  which  was  suggested  by 


Sims  (1978),  will  now  be  outlined. 


Sims  has  stressed  the  Importance  of  exploiting  the  local  sparse 
ness  or  denseness  of  the  sample  data  on  the  matching  variables  Z.  A 
dense  region  of  the  Z  space  Is  one  within  which  we  expect  that  the 
distributions  of  X  and  Y  given  Z  change  little.  It  is,  at  the  same 
time,  a  region  within  which  we  have  many  observations.  Sims  has  sug 
gested  that,  within  a  dense  region,  any  arbitrary  matching  procedure 
will  produce  results  that  do  not  distort  the  joint  distribution  of 
X,  Y  and  Z.  Regions  which  are  not  dense  have  few  observations  and, 
within  them,  statistical  matching  becomes  difficult.  Sims  felt  that 
in  a  sparse  region,  statistical  matchings  will  almost  certainly 
distort  the  Joint  distribution  of  X,  Y  and  Z.  He  suggested  that,  In 
such  a  region,  we  should  either  not  match  at  all  or  go  beyond 
matching  to  more  elaborate  methods  of  generating  synthetic  data. 
However,  Sims  did  not  spell  out  any  specific  alternative  to  matching 
within  sparse  Z  regions . 

In  our  Monte  Carlo  Study  for  comparing  Kadane's  strategies  with 
Sim's,  which  will  be  presented  in  Section  3.3,  we  created  ten  bins 
in  the  Z  space,  namely  (--<*>,-1.00],  (-1.00,-0.75],  (-0.75,-0.50], 
(0.50,-0.25],  (0.25,0.00),  (0.00,0.25],  (0.25,0.50],  (0.50,0.75], 
(0.75,1.00),  (1.00, ♦<*>).  The  conditional  mean  of  X  or  Y,  given  Z  did 
not  change  much  inside  the  eight  bins  which  were  between  1.00 
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and  1.00.  Hence,  these  latter  bins  were  considered  dense  bins  and 
the  two  bins  in  the  left  and  right  tail  of  the  distribution  of  Z  were 
considered  sparse  bins.  Within  each  dense  bin,  we  randomly  matched 
records  of  the  two  files,  whereas  the  isotonic  matching  strategy  of 
Subsection  3.1.1  was  used  in  the  sparse  bins. 
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3 . 2  Alternatives  to  Statistical  Matching 
Under  Conditional  Independence 

Several  criticisms  of  the  matching  methodology  were  mentioned  in 
Section  1.6.  It  was  observed  that  the  formation  of  packets  on  the 
basis  of  matching  variables  Z  and  the  merging  of  records  within  each 
packet  imply  that  the  non-matching  variables  X  and  Y  are  condition¬ 
ally  Independent  given  the  values  of  Z.  Following  A.  P.  Dawid  (1979) 
we  shall  use  the  notation  X  U.  Y  |  Z  to  denote  the  conditional  indepen 
dence  among  the  variables  X,  Y  and  Z. 


Consider  the  situation  in  which  we  match  the  fragmentary  data 


3 

provided 

1.2  that 

y 

*  • 

that  the 

C 

File  2. 

test  the 

more,  Sims  (1978)  has  observed  that  matching  itself  for  the  purpose 
of,  among  others,  estimating  y  in  (3.1.4)  is  unnecessary.  He  pointed 
out  that,  when  X  H  Y  |  Z  holds,  one  can  write 


S 


xz 

where  F  ( . )  is  the  marginal  (with  regard  to  W)  CDF  of  X  and  Z  and 
the  other  terms  on  the  right-hand  side  of  (3.2.1)  are  analogously 
defined  marginal  distribution  functions.  The  two  separate  samples  in 
(3.1.1)  are  adequate  to  estimate  all  the  terms  on  the  right-hand  side 
of  (3.2.1)  by  any  of  a  number  of  statistical  methods.  In  this  sec¬ 
tion,  we  will  discuss  some  alternatives  to  matching.  With  emphasis 
on  estimating  the  covariances  or  correlations  between  X  and  Y,  we 
shall  first  review  a  histogram  type  alternative  which  was  suggested 
by  Sims  ( 1978)  . 

Suppose  that  we  form  a  grid  in  the  W  space  and  estimate  the 
ioint  density  of  W  by  first  counting  the  number  of  sample  points  in 
each  cell  of  the  z  grid.  Let  i  index  X- categories ,  j  index 
Y-categories  and  k  index  Z  categories.  Let  n  be  the  number  of 

1  j  K 

sample  points  in  the  (l.j.k)*-1"1  cell  and  use  the  dot  notation  to 
define  counts  of  sample  points  with  regard  to  marginal  d: strlbutions . 
Thus,  we  have 


i  .k 


th«  number  of  sample  points  with  X  in  the  ith  category 
and  Z  in  the  kth  category. 
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=  the  number  of  sample  points  with  Y  in  the  Jtfl  category 
and  Z  in  the  kth  category. 


and 


.  .k 


the  number  of  sainple  points  with  Z  in  the  kth  category 
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Clearly , 


.k  ^  ni.k  ^  n.  jk 


and  the  data  In  the  two  flies  given  by  (3.1.1)  can  be  used  to  compute 


ni  k*  n  jk  an<1  n  k  *>or  a11  Possible  values  of  1,  j  and  k.  Thus, 


n  Is  obtained  from  File  1,  n  from  File  2  and  n  ,  from  the  two 
l.k  .  jk  .  .k 


flies  together.  Finally,  for  a  known  function,  g( . ) ,  say,  let  g(w 

*  J 


denote  the  value  of  g  computed  at  the  center,  w  of  the  (i,J,k^h 

J  •* 


cell  of  the  grid  that  we  started  with.  Sims  has  suggested  that  we 
could  estimate  y  In  (3.1.4)  by  the  statistic 


y 


l 

i.j.k 


g(*ijk> 


nl.k  n.jk 
n..k 


(3.2.2) 


With  regard  to  y  In  (3.2.2),  theoretical  properties  such  as  the 
asymptotic  distribution  of  y  (as  the  sample  size  tends  to  <=■>)  are 
unknown  at  the  present  time.  Also,  practical  problems  such  as  the 
choice  of  W-grid  and  the  cells  thereof,  which  would  keep  the  number 
of  terms  In  the  sum  (3.2.2)  computationally  reasonable,  have  not  been 
studied  yet. 

Sims  (1978)  stated  that  a  procedure  like  the  one  leading  to  y 
In  (3.2.2),  which  takes  Into  account  the  implicit  assumption  of  con¬ 
ditional  Independence  of  the  matching  methodology,  had  the  following 
advantages  over  matching  to  create  a  synthetic  file  such  as  (3.1.16): 


(a)  the  procedure  lends  Itself  to  computation  of  standard  errors 


Indicating  the  reliability  of  computations  based  on  it 


(b)  the  procedure  can  be  connected  to  the  large  statistical  litera¬ 
ture  on  estimating  density  functions  and  multi-dimenslonal 
contingency  tables,  and 

(c)  It  Is  likely  to  provide  more  accurate  results  than  matching. 

Given  the  lack  of  work  on  the  statistical  properties  of  the  alterna 
tlves  to  matching,  we  can  agree  with  the  advantages  (a)  and  (b),  but 
regard  (c)  as  an  undemonstrated  speculation.  We  shall  not  discuss 
Y  in  (3.2.2)  any  further.  Nor  shall  we  elaborate  the  merits  and 
demerits  of  alternatives  to  matching  and  synthetic  data  based  pro 
cedures .  Nevertheless,  in  the  next  subsection,  we  shall  derive  the 
estimators  of  parameters  for  conditionally  independent  normal  models 
without  matching  the  files  in  (3.1.1). 
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In  this  section,  we  shall  find  the  maximum  likelihood  estimator 
of,  among  others,  the  covariances  among  the  variables  In  the  vectors 
X  and  Y,  without  matching  the  files  (3.1.1)  but  assuming  that 
XJJ_Y | Z .  The  maximum  likelihood  estimation  of  parameters  in 
multivariate  normal  models  based  on  various  patterns  of  missing  data 
has  been  discussed  in  the  literature.  See,  for  example,  Eaton  and 
Kariya  (1983)  Kariya  et  al .  (1983),  Anderson  (1984)  and  Srivastava 
and  Khatri  (1979).  However,  the  pattern  of  data  given  by  the  set-up 
(3.1.1)  does  not  seem  to  have  been  examined.  Note  first  that,  under 
conditional  Independence,  the  density  of  w  can  be  written  as 


fw(w;0)  =  f1(z;Q)f2(xlH<§>f3(xl5*g) 


(3.2.3) 


where  0  -  <Kx.Ky .H2 • Exy . Exz .Eyy . J22> 
and  fy(w)  is  the  Joint  density  of  W  given  by 


rw‘S> 


-  ( Pi+P?  +P3  )  / 2  -V4 

( 2ir )  1  Z  3  HI 


x  etr[-  |  J-1(w  -  jj)(w  -  jfcj)  *  J  ,  (3.2.5) 

etr  being  the  exponential  of  the  trace  of  a  matrix.  Also,  f^(.)  is 
the  marginal  density  functon  of  Z,  f,,(.)  an<*  ^3^'*  are  resPectlvely 
the  conditional  densities  of  X  and  Y,  given  Z  =  z.  It  is  well  known 
(Anderson,  1984,  p.  33  and  37)  that  f  ,  f  and  f^  also  correspond  to 
certain  multivariate  normal  densities  like  (3.2.5).  Using  the  joint 
normality  of  X,  Y  and  Z,  it  is  easy  to  verify  that  (3.2.3)  holds  iff 
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l  =  l  l  1  l 
cxy  ‘•xz  zz  ^zy 


(3.2.6) 


It  follows  from  (3.2.3)  that  the  likelihood  of  the  observed 
data  in  the  two  files  given  by  (3.1.1)  is 


where 


L(6)  =  L  (0)L2(6)L3(e)  , 


n^+n2 

L  (6)  =  n  f  (z  ,0) 
1  i  =  l 


L  (0)  =  n  f  (x  |z  .0) 

i-1 


n^+n2 

l3(0)  =  n  f'3(xllzi.e) 

i=ni*l 


(3.2.7) 


(3.2.8) 


(3.2.9) 


(3.2.10) 


Taking  natural  logarithms  of  both  sides  of  the  equation  (3.2.7),  we 


obtain 


(3.2.11) 


1(9)  -  l  9.  (0)  , 
,  a  ~ 

a-  1 


where  *.  (0)  -  log  (L  (0)),  Va  =  1,2,3 
a  e  a 

Let  z  and  s denote  respectively  the  mean  and  the  matrix  of 
corrected  sums  of  squares  and  product  of  the  data  z^ ,  . . . ,  zn 

1  i 

That  Is, 


z  = -  X  z 

~  Vn2  i=l  ~l 


(3.2.12) 
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ni+n2 

s  =  7  (z  -  z) (z  -  z) ' 

z  1=1  ~i  ~1 


Similarly,  let  z.^  ( z 2>  and  s  (s2>  be  the  mean  and  the  matrix  of 
corrected  sums  of  squares  and  products  of  the  data  z  ,  ....  z 

~l  ~nl 

(z  z  ).  Let,  for  any  lower  case  a,  b  and  c,  and  any 

~n  +1  ~ni+’n2 


vector  z. 


h(5>  =  Ha  -  Xab  Ibb  (2  - 


7  =  1  -7  7~  7 

^ab.c  ab  ac  Ccc  ^cb 


(3.2.13) 


Then  using  the  notations  In  (3.2.12)  and  (3.2.13),  the  equations 
(3.2.5),  (3.2.7)  to  (3.2.10)  and  Theorem  2.5.1  of  Anderson  (1984) 

(for  the  expressions  defining  f2  and  f3>  we  obtain 

ni  ♦n? 

V§>  *  -  T~ 

♦  tr(-  j  Xzz  !sz  *  <n1.n2)(z  -  )!z)(z  -  KZ>'11  (3.2.14) 


1 

l  (6)  =  -  —  loglX  I 
2  ~  2  Bl4xx.z' 


*  tr{  \  X"1  [  7  (x.  -  u  (z,  )  )(x,  -  u  (z .))]} 
1  2  6xx.z  _  ~i  cx*z  ~1  ~1  cx.z  ~1 


(3.2.15) 


S(§)  =  r  iogizyy.zi 
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^  mV  W  wV  l 
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2  yy .  z 


t  1 

j=n1  +  l 


u  ( z, ) ) (y * 
*yz  ~J  'v*-j 


M  ( z  ) )  •  ] } 
ey  z  ~j 


(3.2.16) 


Note  that  In  (3.2.14)  to  (3.2.16),  certain  constant  terms  have  been 
omitted . 


It  Is  clear  from  (3.2.7)  and  (3.2.11)  that  the  M.L.E  of  6  is 
obtained  by  maximizing  SL  (6)  over  0  for  each  a  =  1,2,3  separately. 

a  ~  ~ 

Moreover,  this  maximization  is  easier  if  we  reparametr lze  the  distri¬ 
bution  of  W  by  means  of 


»  ( u  ,£  ,v  ,v  ,£  il  .B  ,B  ), 
Kz  zz  ~xy  ~yz  xx.z  yy . z  xy  yz 


(3.2.17) 


where,  apart  from  the  notations  that  we  have  already  introduced,  we 
have,  for  any  letters  a  and  b 


R  =  V  y 
ab  4ab  ^bb 

and  (3.2.18) 

v  -  u  B  u 
-ab  ab  ^b 


It  can  be  easily  shown  that  there  is  a  one  to  one  correspondence 

between  0  and  n-  Consequent ly ,  if  we  rewrite  l  (0)'s  in  terms  of  n, 

~  ~  a  ~  ~ 

then  maximizing  1.(0)  over  0  Is  equivalent  to  maximizing  ^(n)  over  n, 
for  each  a  -  1,2,3.  The  advantage  of  the  transformation  to  the 

n  space  is  ttiat  l  (n)'s  are  functions  of  disjoint  portions  of  n. 

—  a 

In  fact,  tj(n)  is  the  same  as  1^(6),  whereas  it  follows  from  (3.2.16) 


to  (3.2.1$;  that 


*2(a>  ■  -  IT  ^xx.J 


1  fl 


tr{-  -  X  [  (x.  -  w  -  B  z,)(x,  -  -  B  z .  )  *  ]  } 

1  2  Lxx.z  ,  ,  ~1  ~xz  xz  ~i  ~1  ~xz  xz  ~i  1 


(3.2.19) 


*3<n>  -  -  r  ^yy.21 

tr{-  ~  5iyy.Z[^^  +  1  (Xj  "  iiyZ  ~  ByZ  5  ■)  >  ( X  j  ~  VyZ  ByZ  ~j *  ^ 

(3.2.20) 

In  view  of  Theorem  8.2.1  of  Anderson  (1984),  it  can  be  easily 
shown  using  (3.2.14),  (3.2.19)  and  (3.2.20)  that  M.L.E  of  n  is 
given  by 


u  =  Z 
Kz 


'zz  n^+nj 


B  =  [  l  (Xt  -  JC)(Z1  -  Z^’IS*1 


V  =  X  -  B  Z. 
~xy  ~  xz  ~1 


n  i  +ri2 

BV2  -  I  I  (L  -  T)(Z  ZJ'IS-1 
yZ  3=n^l  J  J 


(3.2.21) 


5WWW 


'xx. z  ^  ^  ~i  ~xz  xz  ~i  ~t  -xz  xz  ~i 


n^+r^ 

I  =  —  z  (Y4-G  -  B  ZJ1Y,  -  C  B  ZJ' 

yy  Z  °2  j^n^l  ~J  ~yZ  yZ  ~J  ~J  ~yZ  yZ  ~J 


f; 


k- 


<. 

s 

y. 

k 

•o 


Using  these  estimators  and  the  relationships  between  0  and  n  we 
obtain  the  M.L.E  of  9  by  means  of  the  following  equations. 


u  =  u  +•  B  u 
rx  ~xz  xz  rz 


n  -  u  +  B  U 
ry  ~yz  yz  cz 


a 

H, 


=  Z 


'xx 


B  £  B’  +  1 

XZ  ZZ  XZ  XX. z 


(3.2.22) 


It  follows  from  the  above  discussion  that  if  we  can  justify  the 
assumption  that  X  ][  Y  |  Z,  then  we  can  avoid  matching  the  flies  in 

(3.1.1)  and  estimate,  among  other  parameters,  IXy  means  of  the 
equations  In  (3.2.22).  Unfortunately,  the  two  data  files  contain  no 

Information  regarding  the  appropriateness  of  this  assumption,  and 
prior  Information  from  other  sources  must  be  considered.  The  point 
here  Is  that,  if  the  matching  methodology  is  based  on  assumptions 
like  X  11  Y  |  Z .  then  we  must  look  for  alternatives  to  matching  whose 
statistical  properties  are  known.  Such  alternatives  are  useful 
especially  because  very  little  Is  known  about  the  reliatllity  of 
synthetic  data  files. 

It  is  important  to  note  that  (3.2.6)  Is  a  necessary  condition 
even  if  W  is  not  normal,  provided  only  that  X  1  Y  |  Z  holds  and  that 
the  appropriate  moments  of  the  distribution  of  W  exist.  Hence,  we 

can  use  the  estimator  7  In  (3.2.22)  even  for  non- normal  popula 

xy 

tlons.  We  now  show  that  7  is  consistent  for  X  without  assuming 

*-xy  xy  b 

that  W  has  a  multi-variate  normal  distribution. 

Theorem  3.2.1  Suppose  the  joint  distribution  of  W  is  such  that  its 

second  order  moments  exist  and  that  the  dispersion  matrix,  £,  of  W  is 

partitioned  as  in  (3.1.3).  If  X  H  Y  |  Z  then  £  ,  given  by 

xy 

(3.2.22),  is  strongly  consistent  for  7 

xy 

Proof:  We  first  note  that  £xz  and  X  are  stochastically  Independent 

because  they  are  functions  of  the  independent  data  in  File  1  and 

File  2  respectively.  However,  7  involves  Z ,'s  in  both  files  so 

zz  1 

that  the  elements  of  the  vector 


^xz^zz’^zy5 


(3.2.23) 


are  dependent.  The  almost  sure  convergence  of  the  vector  In  (3.2.23) 

will  follow  from  the  almost  sure  convergence  of  £  ,£  ,£ 

xz  zz  zy 

individually  (cf.  Serfling,  1980,  p.  52).  In  view  of  the  similar¬ 
ities  of  the  proofs  of  the  convergence  of  these  matrices,  we  shall 
only  show  that,  as  n  -»  ®,  a  =  1,2, 


a .  s 
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'zz 


(3.2.24) 


We  obtain  from  (3.2.21), 
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zz 
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n,  +n„ 


n  i  >  n  2 

i 

i--i 
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(3.2.25) 


Recalling  our  assumption  that  the  files  in  (3.1.1)  are  independent 
random  samples  and  that  the  vector  Z  has  a  finite  dispersion  matrix, 
it  readily  follows  that  the  Strong  Law  of  large  numbers  (cf. 
Serfling,  p.  27)  applies  to  independent  sequences  { Z ^ }  and  {Z^Z^}. 
Hence,  we  obtain,  as  n  ♦  ® 


n  1 4-n2 


a .  s 


—  J  z .  z;  -  E(Z  Z-) 

Vn2  ui  ~i  ~1 


(3.2.26) 


and 


a .  s 

Z  -*  E  ( Z ) 


(3.2.27) 


It  follows  from  (3.2.25)  to  (3.2.27)  that 
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We  conclude  from  our  remarks  earlier  in  this  proof  that,  n  -*  ® 

a 


( ^xz’  ^zz’^zy*  ^xz'^zz’^zy* 


Let  us  now  observe  that 


(3.2.28) 


i  =  £  r1  £ 

xy  xz  zz  zy 


is  a  continuous  function  of  the  random  variables  in  the  vector 

(3.2.23).  Hence,  the  strong  consistency  of  £  follows  from 

xy 

(3.2.28). 


3 .3  An  Empirical  Evaluation  of 
Certain  Matching  Strategies 

Several  distance  based  matching  strategies  for  creating 
synthetic  data  have  been  discussed  in  Section  3.1.  Specifically,  two 
strategies  due  to  Kadane  (1978)  and  a  strategy  which  was  proposed  by 
Sims  (1978)  were  mentioned.  In  this  section,  we  shall  evaluate  these 
three  strategies.  Individually  as  well  as  in  relative  terms,  in  the 
special  case  where  W  =  (X,Y,Z),  the  unobservable  vector,  has  a  trl- 
variute  normal  distribution.  Before  we  discuss  the  Monte- Carlo  Study 
of  the  aforementioned  strategies,  we  shall  review  some  of  the  earlier 
simulation  studies  of  statistical  matching  procedures,  which  have 
certain  bearing  on  our  study.  A  more  comprehensive  review  of  evalua 
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tlons  of  statistical  matching  procedures  can  be  found  In  Rodgers  (1984). 

Barr  et  al.  (1982)  used,  among  others,  a  statistical  model  in 
which  a  vector  W  =  (X,Y,Z  ,Z  )  had  a  four- dimensional  normal  dlstri 
butlon  with  zero  means,  unit  variances  and  various  levels  of 
covariances  among  the  four  variables.  Altogether,  these  investi 
gators  generated  100  pairs  of  independent  files,  namely  File  1 
comprising  200  observations  on  (X,Z  ,Z^)  and  File  2  consisting  of  200 
observations  on  Y,  Z^  and  Z^,  for  each  of  12  populations,  where  the 
populations  differed  with  respect  to  the  covariances  of  the 
variables.  Then,  for  each  such  pair  of  files,  six  statistical 
matches  were  performed,  namely  three  constrained  matches  and  three 
unconstrained  matches.  In  each  of  these  six  matches,  they  used  three 
distance  functions  for  each  type  of  match.  The  first  was  a  weighted 
sum  of  the  absolute  differences  of  the  two  Z  variables  between 
records  of  the  two  files  and  the  last  two  were  the  Mahalanobis 
distance  and  the  "bias  avoiding"  distance,  which  were  discussed  in 
Section  3.1.  A  summary  of  the  findings  of  Barr  et  al.  is  as  follows. 

All  three  distance  measures  provided  accurate  estimates  of  the 
variance  of  the  Y  variable  when  the  constrained  matching  procedure 
was  used.  They  also  found  that  all  three  unconstrained  matching 
procedures  produced  Y  distributions  that  had  means  which  were 
significantly  different  from  the  corresponding  population  values. 

The  estimated  covariances  of  Y  with  Z^.Z^,  which  were  computed  only 
for  constrained  matches,  tended  to  be  underestimated.  With  respect 
to  the  most  important  question  in  the  context  of  merging  files, 
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namely  the  estimation  of  relationships  between  X  and  Y  variables,  It 


was  reported  that,  If  the  conditional  independence  assumption  was 
invalid,  all  statistical  matching  procedures  provided  estimates  of 
the  X-Y  covariance  that  were  extremely  poor.  On  the  other  hand,  for 


the  cases  in  which  the  conditional  Independence  assumption  was  valid, 
all  six  procedures  provided  estimates  of  the  X-Y  covariance  that  were 
generally  quite  accurate.  Their  simulations  also  indicated  that  the 
Mahalanobis  distance  measure  produced  less  accurate  matching  than 
subjectively  weighted  distance  measures. 

As  we  mentioned  earlier,  our  own  Monte  Carlo  study  was  confined 
to  a  trivarlate  normal  model.  However,  our  findings  were  suffi 
clently  interesting  to  justify  their  inclusion  in  this  thesis.  In 
fact,  some  new  facts  about  Kadane's  bias- avoiding  matching  strategy 
have  already  been  mentioned  in  Section  3.1.  Suppose,  then,  that 
W  =  ( X , Y , Z )  is  tri-variate  normal  with  zero  means  and  variance 
covariance  matrix 


p  P 

xy  xz 


p  1 

xy 


p  p  1 

xz  yz 


(3.3.1) 


Assume  further  that  the  following  data  is  available  for  the  purpose 
of  estimating  the  three  unknown  correlations  in  (3.3.1): 


WLWvi!  s  v.'-.  j  v.vvvv 


File  1:  (X  ,Z  ) ,  i  =  1,2 . n 


(3.3.?) 


File  2:  (Y  ,Z  ).  j  =  n+1 . 2n 


(3.3.3) 


In  view  of  the  discussions  in  Section  3.2,  if  the  conditional 
independence  assumption  X  |[  Y  |  Z  or,  equivalently, 


p  =  p  p 

xy  xz  yz 


(  3  .  _.  3) 


were  true,  then  we  can  avoid  merging  the  files  in  (3.3.2)  and  (3.3.3) 

because  File  1  and  File  2  can  be  used  to  get  the  sample  correlations 

p  and  p  ,  which  in  turn  provide  the  maximum  likelihood  estimator 
x  z  yz 

of  p  ,  namely 
xy 


p  -  p  p 

xy  xz  yz 


(3.3.5) 


We  shall  say  X  and  Y  are  conditionally  dependent,  given  Z,  iff 
(3. 3.4)  does  not  hold;  that  is 


p  *  p  p 

xy  xz  yz 


For  the  sake  of  simplicity,  we  shall  consider  hereinafter  only  the 
conditional  positive  dependence  case  of  the  model  in  (3.3.1),  namely 


p  >  p  p 
xy  xz  yz 


(3.3.6) 


The  complementary  case  of  conditional  negative  dependence,  namely 


p  <  p  p 
xy  xz  yz 


can,  however,  be  handled  by  methods  similar  to  ours.  We  shall  also 


Include  the  case  when  X  II  Y  |  Z  holds  mainly  for  comparing  and 
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contrasting  our  results  for  the  positive  dependence  case.  Finally, 

we  shall  evaluate  matching  strategies  only  from  the  point  of  view  of 

estimating  p  ,  the  correlation  between  variables  which  are  not  in 
xy  - 

the  same  file,  because  File  1  and  File  2  can  respectively  be  used  to 

estimate  the  remaining  parameters  p^z  and  p  . 

It  is  clear  that,  If  the  condition  X  |  Y  |  Z  does  not  hold,  then 

we  should  not  estimate  p  by  means  of  (3.3.5).  In  such  a  case, 

xy 

matching  the  files  (3.3.2)  and  (3.3.3)  for  estimation  purposes  is  an 
alternative  that  we  shall  study  In  this  section.  Thus,  if  after 
merging.  File  1  becomes  the  synthetic  File  1  namely 


(vy;-zi>’  1  = i*2 . n 


(3.3.7) 


where  Y^  Is  the  value  of  Y  assigned  to  the  1  record  In  the  process 
of  merging,  then  we  shall  use  the  synthetic  data  (X^.Yp, 

1  =  1,2 . n  to  estimate  p 

xy 

It  was  mentioned  In  Section  1.7  that  performance  characteris¬ 
tics,  which  can  help  us  assess  the  reliability  of  synthetic  data 
generated  by  independent  files  In  (3.3.2),  are  not  known.  Given  this 
paucity,  our  program  for  an  empirical  evaluation  of  matching  strate¬ 
gies  is  as  follows 


Starting  with  a  known  correlation  matrix  given  by  (3.3.1), 
generate  data  from  the  normal  population  of  W  =  (X,Y,Z)  and 
create  Independent  files  (3.3.2)  and  (3.3.3).  Note  that  data 
on  ( X , Y ) ,  which  Is  typically  missing  in  actual  matching 
situations,  Is  available  in  simulation  studies. 


* 


(ii)  Using  any  given  matching  strategy,  merge  the  two  files  created 


In  Step  (1)  and  compute  the  "synthetic  correlation",  denoted 
by  ps<  which  Is  defined  to  be  the  sample  correlation  coeffi¬ 
cient  based  on  the  (X,Y“)  data  given  by  the  synthetic  file 
(3.1.7) 

(ill)  Compare  p^  of  Step  (11)  with  the  following  sample 
correlat ions : 

(a)  p  the  sample  correlation  coefficient  based  on  the 

mil 

unbroken  data  (X^.Y^,  1  =  1,2,  ....  n  which  was  genera 

ted  In  Step  (1).  Observe  that,  if  there  Is  no  aprlorl 
restriction  on  the  model  parameters  in  (3.3.1),  then  p 

mil 

is  the  maximum  likelihood  estimator  of  p 

xy 

(b)  pml2>  the  estimator  of  p ^  given  by  (3.3.5),  which  is 

also  the  maximum  likelihood  estimator  of  p  when  condl 

xy 

t tonal  Independence  holds. 

Because  p  . ,  and  p  are  respectively  based  on  one 
mil  ml 2 

sample  on  (X.Y)  and  two  independent  samples  on  (X,Z)  and 
(Y,Z),  we  shall  also  refer  to  these  as  one  sample  and  two 
.sample  estimates  of  p 

xy 

Using  the  aforementioned  program,  we  shall  evaluate  Kadane's 
distance- based  matching  strategies  discussed  in  Section  3.1,  namely 
the  isotonic  matching  strategy  and  the  procedure  induced  by  the 
Mahalanobis  distance,  and  the  method  of  matching  in  bins,  which,  as 
explained  in  Subsection  3.1.?,  is  an  adaptation  of  a  strategy  due  to 
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Sims  (1978).  The  synthetic  correlations  resulting  from  the  use  of 
these  three  strategies  will  be  denoted  by  p  ,  p  and  p 

Si.  Sc  S  J 


respectively. 

Our  study  has  been  conducted  for  three  values  of  n,  namely  10, 


25  and  50.  The  values  of  the  population  correlation  p  which 


are  used,  among  others,  to  generate  random  deviates  from  the  normal 
population  of  W  =  (X,Y,Z),  were  chosen  from  the  following  categories: 


Low 


Pxy  • 


0.00,  0.25 


Medium  p 


xy 


0.50,  0.60,  0.65,  0.70 


(3.3.8) 


High  p  :  0.75  (0.05)  0.95,  0.99 
xy 


Combined  with  low  as  well  as  high  values  of  p  and  p  ,  there  were 

6  'xz  Kyz 


15  choices  of  p  from  (3.3.8)  such  that  the  conditional 

xy 


Independence  restriction  (3.3.5)  was  satisfied.  As  remarked  earlier, 
these  correlations  were  chosen  mainly  to  provide  a  basis  such  that 


the  estimates  of  p ^  resulting  from  the  case  of  conditional 


positive  dependence  can  be  compared  with  those  resulting  from 


conditional  Independence.  The  fifteen  values  of  pxy  in  the 


conditional  independence  case  were  increased  in  such  a  way  that  the 
positive  dependence  was  achieved.  Altogether,  nineteen  such  £’s 
were  selected. 

For  n=10,  W  was  generated  1000  times  by  using  the  IMSL 


subroutines.  The  calculation  of  p^^  was  based  on  sorting  Z's  in 


the  two  files,  as  discussed  in  Section  3.1.1.  Furthermore,  pg?  was 


computed  for  each  realization  by  solving  a  linear  assignment  problem. 


The  Ford-Fulkerson  algorithm  (Zlonts,  1974)  was  used  for  this 
purpose.  The  computational  cost  for  solving  assignment  problems  grew 
quite  rapidly  with  n.  Therefore,  only  700  independent  samples  of 
size  n-25  were  generated.  A  comprehensive  examination  of  the  results 
for  n=10,25,  revealed  pg^  and  pg the  correlations  corresponding 
to  Kadane's  two  distance  measures,  were,  for  all  practical  purposes, 
identical  (see  Figures  3.1  and  3.2).  In  view  of  this  and  the  high 
computational  costs,  we  compared  only  two  strategies,  the  isotonic 
and  the  method  of  matching  in  bins  for  n=50  (2500  independent 
samples ) . 

Four  summary  statistics,  namely  the  mean,  the  standard 
deviation,  the  minimum  and  the  maximum  for  the  simulated  data  on 

pm9.l’pmS.2’psl,ps2’ps3  were  calculated  for  34  £' s  selected 
for  the  study.  However,  we  provide  these  statistics  only  for  a 
representative  collection  of  15  £'s  in  tables  3.1  to  3.7.  For 
each  l  and  for  any  p,  the  first  entry  in  the  tables  is  the  mean, 
the  second  entry  (in  parentheses)  is  the  standard  deviation  and  the 
third  and  the  fourth  entries  are  respectively  the  minimum  and  the 
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Figures  3.1  to  3.20  provide  an  Illustration  of  these  comparisons. 

3-3.1  Conclusions  of  the  Monte  Carlo  Study 

Tables  3.1  to  3.4  clearly  show  that  the  two  estimates  p  and 

S  1 

ps2’  Provlcle<1  by  the  Isotonic  matching  strategy  and  the  Mahalanobls- 
dlstance  based  strategy,  respectively  have  nearly  Identical  summary 
statistics.  In  fact,  an  examination  of  all  the  results  showed  that, 
for  all  values  of  n  and  £  In  our  study,  the  estimates  p  ,  and  p 

S  x.  Sc 

were  the  same  for  most  of  the  realizations  of  W.  Figures  3.1  and  3.2 
provide  the  empirical  evidence  of  this  fact. 

Now  we  shall  discuss  our  results  in  the  case  of  conditional 
independence.  As  noted  In  Section  3.2,  *s  the  maxlmum  likelihood 

estimator  of  pxy  under  this  model,  whereas  the  method  of 

moments  estimator  based  on  paired-data,  is  computed  for  comparison 


purposes.  As  expected,  Pm(Q  and  Pm^2  behave  equally  well  on  the 


average  even  though  the  estimated  standard  error  of  Pmll  is  consls 


tently  higher  than  that  of  p  _  _  .  Furthermore  the  ranges  of  p  ,, 


N  *- 


are  consistently  larger  than  those  of  p  (see  Tables  3.1,  3.3  and 

ml  2 

3.5)  . 

For  low  correlation  and  each  n,  p  ,,  p  _  and  p  „  compare  well 

si  s2  s3 

with  the  estimates  or  Pml2  as  *'ar  as  the  avera8es  are  concerned 

(see  Tables  3.1,  3.3  and  3.5).  However,  the  synthetic  data  estimators 

have  larger  variation  than  p  as  shown  In  Fig.  3.3  -  Fig.  3.5. 

ml  2 

Furthermore,  all  the  synthetic  data  estimators  have  variation 

comparable  to  that  of  p  as  shown  in  Fig.  3.6  -  Fig.  3.8. 

mil 

For  medium  and  high  values  of  p  ,  all  three  synthetic  estlma- 

xy 

tors  exhibit  some  amount  of  negative  bias  with  regard  to  both 

and  p  Also,  p  the  estimator  given  by  the  method  of  matching 

In  bins,  Is  more  negatively  biased  than  pgl  and  p .  Tables  3.1,  3.3 

and  3.5.  Fig.  3.9  -  Fig.  3.14  illustrate  these  points.  Again,  p  Is 

S  j 

•* 

worse  than  p ^  and  ps?  .  These  patterns  among  the  five  estimates 

exist  for  any  sample  size  even  though  the  difference  between 

synthetic  data  estimators  and  p  „ _  tends  to  decrease  as  n  Increases. 

ml2 

Turning  to  the  conditional  positive  dependence  case,  we  first 

note  that  p  . ,  is  a  reasonable  estimator  of  p  ,  even  though  it  would 
mil  xy 

not  be  available  to  the  practitioner.  On  comparing  p^^  with  the 

synthetic  data  estimators  p  ,  ,  p  _,  and  p  „  and  p  „ „ ,  we  find 

si  s2  s3  ml  2 

■••at  r  h**-;*»  **st 'n.o  ,  •  '  perform  v»ry  ba<* y ,  in  that  all  of  them  are 

»-•'•'  i  y  i  ■  1  e I  •  ‘  *  !  f r  .  1  ’  *  *.»•!'•»)■  I  .T"  1 1 •  ■■  I  v  '.  !  y  n»>g,»t  1  Ve  !  V  lit*: 

-  '.  •  •  .  2 ,  '  .'i.  ’  ,i i'd.  a  7  and  t'l  g  1  ir'  t 


tff. 


the  three  synthetic  data  estimators  have  a  definite  negative  bias 
compared  with  p  .  Tables  3.2,  3.4,  3.6  and  3.7  and  Fig.  3.16  - 

Fig.  3.19  support  this  conclusion.  Furthermore  it  is  observed  that 
p  based  on  binning  is  worse  than  p  (p  )  as  illustrated  by 

S  j  SI  Sc 

Fig.  3.20.  However,  the  difference  between  the  average  and 

pgl,  i  =  1,2,3  tends  to  decrease  as  n  increases. 

Finally  it  must  be  pointed  out  that  as  the  positive  dependence 

increases;  ie,p  -p  p  increases,  the  bias  in  the  three 
xy  xzyz 

synthetic  data  estimators  and  increases.  Tables  3.4  and  3.7 

illustrate  this  fact. 

Based  on  these  observations,  we  must  conclude  that  when 
conditional  Independence  model  holds,  the  synthetic  data  estimators 
do  not  provide  any  advantage  over  Pml2.  the  no-matching  estimator. 

In  fact,  they  are  slightly  worse  than  the  pm^2.  0n  the  other  hand, 
in  the  case  of  conditional  positive  dependence,  Pm^2  and  all  the 
synthetic  data  estimators  perform  badly,  the  performance  of 
synthetic  data  estimators  being  slightly  worse  than  that  of  Pml2- 
Thus  estimators  based  on  matching  strategies  do  not  seem  to  provide 
any  advantage  over  the  estimators  based  on  the  assumption  of 
conditional  independence  and  no  matching.  Thus  for  estimating  p 

xy 

in  Case  III  models,  the  extra  work  involved  in  matching  data  files 
is  almost  worthless.  Further  studies  are  in  order  for  much  larger 
sample  sizes  to  examine  if  this  picture  changes  at  all.  We  should 
point  out  that  it  is  possible  that  matching  may  be  useful  for 


extracting  some  other  features  of  the  joint  distribution  and  further 
Monte  Carlo  studies  are  warrented  to  explore  this. 


| 
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Table  3.1  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=10  Records 
Conditional  Independence  Case 


0.0149  -0.0032  -0.0101  0.0100  0.0114 

(0.3384)  (0.1127)  (0.3296)  (0.3297)  (0.3212) 
0.00  0.10  0.00  -0.8170  -0.6844  -0.7676  -0.7676  0.8606 

0.8472  0.4675  0.8590  0.8590  0.7708 


0.5879  0.5794  0.5457  0.5467  0.5105 


0.92  0.65  0.60 


0.93  0.75  0.70 


(0.2212)  (0.2006)  (0.2337)  (0.2337)  (0.2396) 

-0.6523  -0.4040  -0.6058  -0.6058  -0.6058 

0.9753  0.9431  0.9626  0.9626  0.9681 


0.6830  0.6638  0.6150  0.6151  0.5748 

(0.1986)  (0.1728)  (0.2087)  (0.2086)  (0.2230) 
-0.3369  -0.1437  -0.3115  -0.3115  0.3396 

0.9936  0.9609  0.9576  0.9576  0.9696 


•*  -  “  •  »  »  -  »  *  '  •  « 


>  V  v*  v  v  v 

1  %  *  «.  *  »  ■  t  *  I 


0.7863 

0.7775 

(0 . 1445) 

(0.1182) 

0.94 

0.85 

0.80 

-0.3432 

0.2058 

0.9879 

0.9566 

0.8937 

0.8901 

(0.0764) 

(0.0625) 

0.95 

0.95 

0.90 

0.3247 

0.3508 

0.9949 

0.9814 

0.9448 

0.9421 

(0.0419) 

(0.0317) 

0.97 

0.97 

0.95 

0.5329 

0.7364 

0.9973 

0.9910 

0.7302 

0.7302 

0.6874 

V 

•>: 

Is 

(0.1522) 

-0.2367 

(0.1522) 

-0.2367 

(0.1731) 

-0.2367 

A.' 

1 

0.9799 

0.9799 

0.9723 

V 

y 

K£ 

0.8252 

0.8251 

0.7789 

(0.0994) 

(0.0995) 

(0.1236) 

m 

0.3821 

0.3821 

0.1796 

- 

■  , 

>: 

0.9850 

0.9850 

0.9725 

«>. 

ftf 

a 

/mJ 

0.8758 

0.8760 

0.8238 

(0.0741) 

(0.0741) 

(0.1063) 

TV 

0.5027 

0.5027 

0.2123 

% 

0.9898 

0.9898 

0.9868 

£ 
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Table  3.2  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=10  Records 
Conditional  Positive  Dependence  Case 


pxz 

pyz 

P 

xy 

pmll 

pml2 

Psl 

ps2 

ps3 

0.9413 

-0.0046 

-  0.0289 

0.0395 

0.0153 

(0.0474) 

(0.1142) 

(0.3310) 

(0.3327) 

(0.3269) 

0.00 

0.10 

0.95 

0.5942 

-0.5723 

-0.8425 

-0.8525 

0.8962 

0.9959 

0.5302 

0.8897 

0.8897 

0.8181 

0.8676 

0.5729 

0.5276 

0.5108 

0.4919 

(0.0885) 

(0.2021) 

(0.2403) 

(0.2443) 

(0.2483) 

0.92 

0.65 

0.88 

0.2744 

-0.5510 

-0.6166 

0.6248 

-0.6119 

0.9914 

0.9407 

0.9621 

0.9621 

0.9621 

0.9103 

0.6771 

0.6310 

0.6262 

0.5834 

(0.0666) 

(0.1617) 

(0.2018) 

(0.2050) 

(0.2085) 

0.93 

0.75 

0.92 

0.4811 

-0.2063 

-0.3529 

-0.3529 

-0.2667 

0.9918 

0.9448 

0.9722 

0.9722 

0.9892 

£ 
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Table  3 

.2  (Cont'd.) 

pxz 

pyz 

P 

xy 

pmil 

Pm42 

Psl 

ps2 

Ps3 

0.9558 

0.7741 

0.7188 

0.7165 

0.6687 

(0.0353) 

(0.1153) 

(0.1573) 

(0.1578) 

(0.1781) 

0.94 

0.85 

0.96 

0.6288 

0.2202 

-0.2325 

-0.2325 

-0.1806 

0.9960 

0.9798 

0.9707 

0.9707 

0.9535 

0.9775 

0.8871 

0.8225 

0.8211 

0.7770 

(0.0177) 

(0.0640) 

(0.1036) 

(0.1040) 

(0.1231) 

0 . 95 

0.95 

0.98 

0.8491 

0.4165 

0.2546 

0.2546 

0.0215 

0.9986 

0.9783 

0.9922 

0.9922 

0.9727 

0.9888 

0.9439 

0.8770 

0.8774 

0.8258 

(0.0088) 

(0.0329) 

(0.0760) 

(0.0755) 

(0.1039) 

0.97 

0.97 

0.99 

0.9184 

0.6081 

0 . 4432 

0.4432 

0.3541 

0.9992 

0.9919 

0.9894 

0.9894 

0.9857 
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Table  3.3  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n-25  Records 
Conditional  Independence  Case 


xz 


yz 


xy 


mS.1 


mS.2 


si 


s2 


s3 


-0.0068 


0.0001 


0.0026  0.0040 


0.00 


0.10 


0.00 


0.0025 

(0.2059)  (0.0479)  (0.2013)  (0.2014)  (0.2008) 

0.5749  0.5749  0.6980 

0.6196 


-0.6576  -0.2851 

0.5450  0.2501 


0.6196  0.5087 


V.  J 


■ft 


vfd 

V 


] 


Ki 

/>»• 


$3? 


Si 


.  „N 
% 


t 

0.5915 

0.5788 

0.5568 

0.5564 

0.5171 

•1 

V\i 

(0.1336) 

(0.1231) 

(0.1365) 

(0.1365) 

(0.1476) 

v‘,y 

!•- 

0.92 

0.65 

0.60 

-0.0576 

-0.0890 

0.0259 

0.0259 

-0.0468 

\\\ 

0.8704 

0.8189 

0.8663 

0.8663 

0.8096 

v*. 

| 

• 

0.6859 

0.6859 

0.6620 

0.6627 

0.6111 

-v: 

v:« 

(0.1087) 

(0.0935) 

(0.1096) 

(0.1097) 

(0.1216) 

V 

V 

0.93 

0.75 

0.70 

0 . 2953 

0.2697 

0.1828 

0.1828 

0.1642 

0.9022 

0.8959 

0.8955 

0.8955 

0.8973 

• 

r  - 

» * 
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_ _ “ 

”v\J 
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Table  3.3  (Cont'd. ) 


pxz 

pyz 

P 

xy 

pmtl 

pm(L2 

psl 

PS2 

ps3 

0.7993 

0.7934 

0.7644 

0.7643 

0.7129 

(0.0754) 

(0.0617) 

(0.0789) 

(0.0790) 

(0.0964) 

0.94 

0.85 

0.80 

0.4274 

0.4778 

0.4617 

0.4617 

0.2724 

0.9380 

0.9087 

0.9139 

0.9139 

0.9241 

0.8967 

0.8961 

0.8648 

0.8643 

0.8049 

(0.0416) 

(0.0313) 

(0.0473) 

(0.476) 

(0.0676) 

0.95 

0.95 

0.90 

0.7057 

0.7592 

0.6580 

0.6580 

0.4614 

0.9753 

0.9636 

0.9632 

0.9632 

0.9297 

0.9479 

0.9473 

0.9117 

0.9123 

0.8485 

(0.0211)  (0.0154)  (0.0327)  (0.0326)  (0.0605) 

0.8446  0.8638  0.7636  0.7636  0.5102 

0.9874  0.9755  0.9735  0.9735  0.9519 


0.97  0.97  0.95 
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Table  3.4  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=25  Records 
Conditional  Positive  Dependence  Case 


pxz 

pyz 

o 

xy 

Pmll 

Pmi.2 

PS1 

Ps2 

Ps3 

0.9475 

-0.0019 

0.0058 

-  0.0372 

-0.0004 

(0.0222) 

(0.0439) 

(0.2061) 

(0.2038) 

(0.1989) 

0.00 

0.10 

0.95 

0.8249 

-  0.2817 

-0.5665 

-0.5480 

0.7596 

0.9857 

0.1963 

0.6964 

0.6964 

0.5557 

0.8758 

0.5857 

0.5643 

0.5149 

0.5277 

(0.0503) 

(0.1207) 

(0.1331) 

(0.1436) 

(0.1425) 

0.92 

0.65 

0.88 

0.6051 

0.1442 

0.1621 

0.0617 

0.0404 

0.9738 

0.8344 

0.8896 

0.8896 

0.8512 

0.9143 

0.6907 

0.6627 

0.6489 

0.6190 

(0.0361) 

(0.0851) 

(0.1058) 

(0.1093) 

(0.1125) 

0.93 

0.75 

0.92 

0.6844 

0.2967 

0.2949 

0 . 2641 

0.1829 

0.9774 

0.8876 

0.8661 

0.8642 

0.9020 

Awmw.v.nvi 
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0.9578 

0.7931 

0.7641 

0.7539 

0.7127 

0.0174) 

(0.0624) 

(0.0832) 

(0.0853) 

(0.0948) 

0.8756 

0.5449 

0.3612 

0.3647 

0.3425 

0.9893 

0.9226 

0.9181 

0.9174 

0.9128 

0.9792 

0.8956 

0.8614 

0.8543 

0.7998 

0.0096) 

(0.0308) 

(0.0496) 

(0.0516) 

(0.0691) 

0.9131 

0.7693 

0.6315 

0.6226 

0.5157 

0.9959 

0.9661 

0.9647 

0.9647 

0.9413 

0.9895 

0.9475 

0.9123 

0.9139 

0.8499 

(0.0042)  (0.0158)  (0.0339)  (0.0336)  (0.0584) 

0.9685  0.8769  0.7182  0.7352  0.5685 

0.9972  0.9833  0.9769  0.9849  0.9773 


Table  3.5  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n^50  Records 
Conditional  Independence  Case 


Pyz 

^xy 

Pmn 

Pn\l2 

P  sl 

Ps3 

-0.0004 

-0.0003 

-0.0019 

-0.0044 

(0.1436) 

(0.0242) 

(0.1474) 

(0.1445) 

0.10 

0.00 

-0.4381 

-0.1663 

-0.4872 

-0.5205 

0.4746 

0.1244 

0.4398 

0.4574 

0.5936 

0.5952 

0.5823 

0.5391 

(0.0916) 

(0.0794) 

(0.0909) 

(0.0959) 

0.65 

0.60 

0.2530 

0.2219 

0.2242 

0.1098 

0.8377 

0.8103 

0.7998 

0.7873 

0.6950 

0.6953 

0.6807 

0.6279 

(0.0756) 

(0.0612) 

(0.0709) 

(0.0815) 

0.75 

0.70 

0.2796 

0.3696 

0.3760 

0.2526 

0.8768 

0.8426 

0.8718 

0.8543 
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Table  3.5  (Cont’d.) 


Pxz 

Pyz 

Pxy 

Pmll 

Pm!  2 

Psl 

Ps3 

0.7959 

0.7974 

0.7797 

0.7198 

(0.0528) 

(0.0408) 

(0.0527) 

(.0.0645) 

0.94 

0.85 

0.80 

0.5689 

0.5664 

0.4919 

0.4531 

0.9204 

0.9082 

0.9222 

0.8821 

0.8982 

0.8978 

0.8778 

0.8110 

(0.0289) 

(0.0200) 

(0.0306) 

(0.0493) 

0.95 

0.95 

0.90 

0.7152 

0.7845 

0.7331 

0.6079 

0.9634 

0.9467 

0.9595 

0.9149 

0.9486 

0.9490 

0.9276 

0.8559 

(0.0151) 

(0.0103) 

(0.0199) 

(0.0419) 

0.97 

0.97 

0.95 

0.8549 

0.9100 

0.8039 

0.6529 

0.9808 

0.9743 

0.9761 

0.9576 

\  y 
^  y 
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Table  3.6  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=50  Records 
Conditional  Positive  Dependence  Case 


- 

- 

- 

Pxz 

Pyz 

Px  y 

Pmi.1 

Pm*.  2 

P  si 

Ps3 

0.9491 

0.0001 

0.0015 

0.0025 

(0.0148) 

(0.0245) 

(0.1475) 

(0.1427) 

0.00 

0.10 

0.95 

0.8700 

-0.1447 

-0.5256 

-0.5157 

0.9828 

0.1506 

0.4727 

0.5145 

0.8776 

0.5934 

0.5809 

0.5358 

(0.0336) 

(0.0817) 

(0.0928) 

(0.0981) 

0.02 

0.65 

0.88 

0.6908 

0.2791 

0.1519 

0.1593 

0.9576 

0.8031 

0.8181 

0.8338 

0.9183 

0.6944 

0.6771 

0.6257 

(0.0225) 

(0.0638) 

(0.0752) 

(0.0834) 

0.93 

0.75 

0.92 

0.8119 

0.4028 

0.3506 

0.2950 

0.9698 

0.8628 

0.8599 

0.8595 

0.9595 

(0.0116) 

0.8793 

0.9853 

0.9794 

(0.0061) 

0.9390 

0.9932 

0.9898 

(0.0029) 

0.9736 

0.9964 


0.7967 

(0.0415) 

0.6023 

0.8960 

0.8973 

(0.0200) 

0.8096 

0.9506 

0.9492 

(0.0107) 

0.8927 

0.9757 


0.7803 

(0.0512) 

0.5699 

0.9158 

0.8776 

(0.0294) 

0.7596 

0.9570 

0.9281 

(0.0200) 

0.8181 

0.9713 


0.7198 

(0.0627) 

0.3595 

0.8824 

0.8106 

(0.0468) 

0.6273 

0.9279 

0.8555 

(0.0426) 

0.6501 

0.9555 


Table  3.7  Summary  Statistics  of  Sample 
Correlations  -  Files  with  n=25  Records 
Conditional  Positive  Dependence  Case 


0.4933 

0.0008 

-0.0027 

-0.0063 

0.0012 

(0.1574) 

(0.0451) 

(0.2117) 

(0.2105) 

(0.2044) 

-0.0632 

-0.1632 

-0.6421 

-0.6421 

-0.0035 

0.8777 

0.1976 

0.6186 

-0.6186 

0.5807 

0.7425 

0.5876 

0.5655 

0.5622 

0.5236 

(0.0940) 

(0.1108) 

(0.1292) 

(0.1301) 

(0.1430) 

0.2986 

0.1141 

-0.0065 

-0.0065 

0.0205 

0.9390 

0.8326 

0.8621 

-0.8621 

0.8285 

0.7943 

0.6919 

0.6683 

0.6691 

0.6249 

(0.0762) 

(0.0889) 

(0.1109) 

(0.1102) 

(0.1180) 

0.3982 

0.3129 

0.1844 

0.1844 

0.2023 

0.9373 

0.8978 

0.9047 

0.9047 

0.8853 
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Figure  3.3  Isotonic  vs.  Nomatching. 
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Figure  3.6  Isotonic  vs.  Nomatching. 

pxz  =  0.00,  pyz  =  0.10,  pxy  =  0.00,  n  =  10. 
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Figure  3.11  Matching  in  Bins  vs.  Nomatchlng. 

PXZ  =  0*93,  Py2  =  0.75,  pxy  =  0.70,  n  =  25. 
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Figure  3.13  Mahalanobis  vs.  Nomatching. 

Pv7  *  0.93,  ■ w 7  =  0.75,  Pyv  —  0.70,  n  =  25. 
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Figure  3.14  Matching  in  Bins  vs.  Nomatching. 

Py7  =  0 .93  f  Pv7  s  0.75.  Pyv  =  0.70,  n  =  25. 
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halanobls  vs.  Nomatching. 

2  =  0.00,  Pyz  =  0.10,  pXy  *  0.05,  n 
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Figure  3.18  Isotonic  vs.  Nomatching. 

pxz  =  0.9^,  pyz  =  0.85,  pxy  =  0.96,  n  =  50 
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